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THE  HUMAN  GENOME  PROJECT:  HOW  PRI- 
VATE SECTOR  DEVELOPMENTS  AFFECT 
THE  GOVERNMENT  PROGRAM 


WEDNESDAY,  JUNE  17,  1998 

House  of  Representatives, 

Committee  on  Science, 
Subcommittee  on  Energy  and  Environment, 

Washington,  DC. 

The  Subcommittee  met,  pursuant  to  notice,  at  1:05  p.m.,  in  room 
2318,  Rayburn  House  Office  Building,  Hon.  Ken  Calvert,  Chairman 
of  the  Subcommittee,  presiding. 

Chairman  Calvert.  This  hearing  of  the  Energy  and  Environ- 
ment Subcommittee  will  come  to  order. 

Today  we  will  review  a  program  whose  success  will  have  pro- 
found importance  for  medical  science  for  the  21st  Century.  Some  of 
our  witnesses  today  have  used  some  strong  language  in  describing 
the  value  of  the  human  genome  project,  but  it's  hard  to  exaggerate 
the  importance  of  a  program  that  could  lead  to  prevention,  and 
even  cures,  to  some  of  the  most  serious  diseases  that  afflict  us.  The 
sequencing  of  the  human  genome  began  in  the  mid-1980's  as  an  ef- 
fort by  the  Department  of  Energy  (DOE)  to  study  the  effects  of  ra- 
diation on  the  survivors  of  Hiroshima  and  Nagasaki.  However,  it 
became  an  international  program  with  much  broader  implications 
and  our  federal  program  is  jointly  run  by  DOE  and  the  National 
Institutes  of  Health.  As  the  15-year,  $3  billion  federal  program 
reached  its  halfway  point  this  year,  the  scientific  world  was 
stunned  on  May  9th  when  one  of  the  country's  foremost  genetic  sci- 
entists. Dr.  Craig  Venter,  and  the  Perkin-Elmer  Corporation  an- 
nounced they  would  form  a  new  venture  to,  as  they  put  it,  "sub- 
stantially complete  the  sequencing  of  the  human  genome"  in  3 
years  at  one-tenth  the  cost  of  the  federal  program. 

Just  how  this  should  affect  the  government  program  is  the  focus 
of  this  hearing  today.  Press  reports  and  some  back  and  forth  be- 
tween critics  and  supporters  of  the  federal  program  have  raised  as 
many  questions  as  it  has  produced  answers.  For  example,  are  the 
goals  of  the  initiative  realistic  or  just  an  optimistic  vision?  Will  this 
private  sector  initiative  duplicate  the  federal  program  and  make  it 
redundgmt  or  is  it  another  approach  that  can  complement  the  fed- 
eral program  and  make  it  stronger?  Is  the  pace  and  the  cost  of  the 
federal  program  increased  by  the  bureaucratic  nature  of  any  fed- 
eral program  or  does  the  timetable  and  cost  reflect  what  is  nec- 
essary to  do  a  thorough  job?  And  will  the  federal  program  utilize 

(1) 


the  latest  technology  described  in  the  private  sector  announce- 
ment? 

Our  witnesses  today,  a  cross-section  of  distinguished  scientists 
from  the  government  and  from  the  private  sectors,  should  be  able 
to  supply,  I  hope,  some  of  the  answers  to  those  questions. 

One  of  the  witnesses  today  warns  that  Congress  is  the  wrong 
forum  in  which  to  debate  the  relative  merits  of  different  scientific 
approaches  to  sequencing  the  human  genome.  Let  me  say  I  couldn't 
agree  more.  We're  not,  as  my  friend  Greorge  Brown  might  say,  set 
up  to  be  a  science  court. 

However,  we  are  given  the  responsibility  of  overseeing  a  federal 
program  that  has  spent  about  $1.9  billion  to  date.  The  purpose  of 
this  hearing  is  to  get  the  best  advice  possible  on  how  to — how  addi- 
tional moneys  should  be  spent. 

I  would  also  like  to  take  a  moment  to  thank  our  witnesses  for 
being  here  today.  Some  of  you  traveled  long  distances  at  your  own 
expense;  others  had  to  rearrange  their  personal  schedules  to  fit 
ours,  and  we  certainly  appreciate  it. 

Before  I  introduce  our  panel,  let  me  turn  to  my  good  friend  from 
Indiana,  the  distinguished  Ranking  Minority  Member,  Mr.  Roemer, 
for  his  opening  remarks. 

Mr.  Roemer.  I  thank  our  distinguished  Chairman  and  want  to 
applaud  him  and  salute  him  for  this  timely  hearing  on  such  a  com- 
plicated, yet  fascinating,  subject.  I  would  ask  unanimous  consent 
that  my  entire  statement  be  entered  into  the  record,  Mr.  Chair- 
man. 

Chairman  Calvert.  Without  objection,  so  ordered. 

Mr.  Roemer.  And  I  will  just  talk  for  a  few  seconds  and  then  yield 
back  the  balance  of  my  time  to  this  expert  panel.  Certainly  we 
have  heard  the  mantra  in  this  Congress  of  faster,  cheaper,  better. 
We  have  heard  promises  at  times  from  the  public  sector,  and  prom- 
ises at  times  from  the  private  sector,  that  appeared  too  good  to  be 
true.  Here  we  have  the  possibility,  a  golden  possibility,  of  a  private- 
public  partnership  that  could  result  in  phenomenal  return  for 
science  and  in  phenomenal  return  for  the  taxpayer.  We  want  to  see 
if  these  promises,  and  if  this  potential,  is  in  fact  true  and  if,  in  fact, 
we  can  do  this  partnership  between  the  public  and  private  sector 
that  some  have  talked  about.  We  want  to  look  at  the  question  of 
privacy  and  patent  issues.  We  want  to  look  at  many  other  serious 
questions  when  it  results  in  cutting  the  costs  as  has  been  talked 
about  in  the  press  by  such  a  significant  degree,  yet  yielding  the 
science  that  we  have  been  talking  about  for  the  last  decade.  So  I'm 
anxious  to  hear  from  our  expert  witnesses.  I'm  very,  very  inter- 
ested in  this  topic  and  we  look  forward  to  our  expert  panel  giving 
us  the  insight  and  the  advice  to  fulfill  the  mantra  of  faster,  cheap- 
er, better,  not  just  with  political  rhetoric  but  with  real  promise  for 
a  private  sector,  public  sector  partnership.  And  with  that,  I  yield 
back  the  balance  of  my  time. 

[The  prepared  statement  of  Mr.  Roemer  follows:] 
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I  would  like  to  thank  the  Subcommittee  Chairman  for  his  foresight  and  timely  action  in 
calling  this  hearing.  This  development  is  a  complicated  one,  not  just  in  terms  of  what  it 
will  mean  for  our  federal  programs,  although  that  is  the  most  prominent  question,  but  in 
terms  of  what  it  will  mean  for  our  citizens  and  our  international  relationships. 

In  these  times  of  balanced  budgets,  tobacco  settlements,  and  huge  international  projects, 
the  lOS*  Congress  has  readily  embraced  the  "faster,  better,  cheaper"  mantra.  Often,  but 
not  always,  for  very  good  reasons.This  pattern  seems  to  be  holding  as  we  address  the 
decision  made  by  Craig  Venter  and  the  Perkin-Elmer  Corporation  to  form  a  new  company 
that  claims  it  will  complete  the  sequence  of  the  entire  genome  in  3  years  at  about  1/10  the 
cost  of  the  Federal  Human  Genome  Project 

This  development  has  raised  the  question  of  whether  or  not  we  in  Congress  should  scale 
back  our  federal  programs  based  simply  on  the  promise  of  respected  and  experienced 
scientists  and  an  equally  respected  and  estabhshed  private  corporation.  The  purpose  of  this 
hearing  is  to  determine  if  that  line  of  thinking  is  premature. 


At  thi^  point,  I  am  ^ore  concerned  with  the  inevitable  changes  that  will  occur  as  the 
mission  shifts  from  public  interest  to  private  profit.  While  I  do  not  discount  the  sentiment 
and  motive  behind  the  search  for  this  hfe-saving  knowledge,  I  think  that  it  is  only  right  to 
address  the  possible  pitfalls  of  private-sector  control  of  this  genetic  information. 
Commercialization  can  promote  the  availability  of  new  treatments,  but  it  can  also  stifle 
discovery  and  iimovation.  Also,  issues  of  privacy  have  never  been  fully  addressed.  The 
complexity  of  these  issues  should  not  be  underestimated  and  an  appropriate  balance  must 
be  struck. 

So  I  thank  you  again  Mr.  Calvert  and  I  welcome  our  witnesses.  I  hope  that  they  will  be 
able  to  shed  some  light  on  how  the  involved  parties  might  form  a  symbiotic  relationship 
between  the  Federal  Human  Gemome  Project  and  the  proposed  private-sector  project, 
and  how  they  plan  to  ensure  that  the  rights  of  the  American  people  are  not  violated  or 
their  needs  exploited. 


Chairman  Calvert.  I  thank  the  gentleman. 

Our  first  witness  is  Dr.  Ari  Patrinos,  Associate  Director  of  En- 
ergy Research  for  the  Department  of  Energy  wha  oversees  the 
human  genome  project  for  DOE.  Dr.  Francis  ColHns  is  Director  of 
the  National  Human  Genome  Research  Institute  for  the  National 
Institutes  of  Health;  Dr.  Craig  Venter  is  President  of  the  Institute 
for  Grenomic  Research  in  Rockville,  Maryland,  and  is  one  of  the 
partners  in  the  private  sector  initiative  announced  on  May  9th;  Dr. 
David  Galas  isjPresident  and  Chief  Executive  Officer  of  CHIRO 
Science  R&D-Tnc.  of  Washington  State.  Dr.  Galas  at  one  time 
served  as  Director  for  Health  and  Environmental  Research  at  the 
Department  of  Energy;  and  Dr.  Majoiard  Olson  is  Professor  of  Med- 
icine for  the  Division  of  Medical  Genetics  at  the  University  of 
Washington. 

Gentlemen,  it's  our  policy  to  swear  in  all  witnesses.  So  I  would 
ask  you  to  rise  for  me  please. 

Do  you  solemnly  swear  to  tell  the  truth,  the  whole  truth,  and 
nothing  but  the  truth? 

Mr.  Patrinos.  I  do. 

Dr.  Collins.  I  do. 

Mr.  Venter.  I  do. 

Mr.  Galas.  I  do. 

Mr.  Olson.  I  do. 

Chairman  Calvert.  You're  sworn  in.  Let  the  record  show  that  all 
answered  in  the  affirmative. 

You  may  be  seated. 

Without  objection,  the  full  written  testimony  for  each  of  you  will 
be  included  in  the  record.  I  would  ask  that  each  of  you  summarize 
your  remarks  in  approximately  5  minutes  so  we'll  have  sufficient 
time  for  questions. 

Dr.  Patrinos,  you  may  begin  your  opening  statement. 

TESTIMONY  OF  ARISTmES  A-  PATRINOS,  ASSOCIATE  DIREC- 
TOR OF  ENERGY  RESEARCH  FOR  BIOLOGICAL  AND  ENVI- 
RONMENTAL RESEARCH,  U.S.  DEPARTMENT  OF  ENERGY, 
WASHINGTON,  DC 

Mr.  Patrinos.  Thank  you,  Mr.  Chairman,  Mr.  Roemer.  I  am 
pleased  to  testify  before  the  Subcommittee  on  the  future  of  the 
human  genome  project  and,  specifically,  how  the  new  private  sector 
venture,  will  help  shape  our  program.  I'm  honored  to  testify  along 
with  such  a  distinguished  set  of  scientists,  the  gentlemen  to  my 
left.  The  Department  of  Energy  takes  great  pride  in  its  pioneering 
in  the  human  genome  project  that  will  essentially  revolutionize  bi- 
ology and  help  usher  in  a  new  millennium  of  wonderful  applica- 
tions in  medicine,  environmental  bioremediation,  and  sustainable 
development. 

Back  in  1986,  the  Biological  and  Environmental  Research  pro- 
gram that  I  have  the  privilege  of  directing  presently,  while  seeking 
a  molecular  level  understanding  of  the  effects  of  ionizing  radiation 
on  human  biology,  proposed  to  sequence  the  3  billion  base  pairs  of 
human  DNA  and  identify  the  important  genes  on  the  23  pairs  of 
chromosomes. 

It  was  a  proposal  that  at  the  time  was  considered  with,  or  at 
least  was  met  with  considerable  skepticism  and,  I  might  add,  some 


hostility  as  well.  However,  the  rest  is  history,  as  you  know,  and  in 
1990,  along  with  our  colleagues  at  the  National  Institutes  of 
Health,  we  formally  launched  the  Human  Grenome  Program,  along 
with  a  common  5-year  plan  that  we  updated  in  1993  because  of 
faster-than-expected  progress.  As  you  mentioned.  Dr.  Galas,  who 
was  my  predecessor  in  this  job,  was,  in  fact,  in  charge  of  the  DOE 
element  of  the  program  at  that  time.  Last  month  representatives 
of  our  two  agencies  from  the  NIH  and  the  Department  of  Energy 
met  with  key  members  of  the  scientific  community  to  work  out  the 
details  of  the  next  5-year  plan  that  we  expect  to  issue  in  October, 
officially  October  of  this  year,  and  I  expect,  we  expect  that  this 
plan  will  be  coordinated  with  our  international  partners  such  as 
the  Sanger  Center  in  the  United  Kingdom,  as  well  as  with  private 
sector  ventures  such  as  initiative  that  you  made  reference  to,  the 
initiative  launched  by  Dr.  Craig  Venter  of  the  Institute  for 
Genomic  Research  and  Perkin-Elmer. 

At  the  midpoint  of  its  projected  15-year  lifetime,  the  human  ge- 
nome program  is  embarking  on  its  high-volume  DNA  sequencing 
phase.  This  has  been  made  possible  because  of  advances  in  se- 
quencing technologies,  because  of  advances  in  informatics  and  also 
because  of  enhanced  access  to  cloned  resources.  The  Department  of 
Energy  has  met  this  challenge  by  creating  the  Joint  Genome  Insti- 
tute and  merging  the  resources  and  capabilities  and  talents  of  our 
three  genome  centers  at  our  laboratories  at  Berkeley,  Los  Alamos, 
and  Livermore.  The  DOE  expects  to  do  its  fair  share  of  high-vol- 
ume DNA  sequencing  at  the  sequencing  factory  that  we  are  estab- 
lishing at  Walnut  Creek,  California. 

From  the  very  beginning  the  human  genome  program  has  fo- 
cused on  developing  technologies  and  resources  that  would  advance 
the  utility  and  science  of  the  information  contained  in  the  human 
genome  and  it  is  in  that  vein  that  we  welcome  the  private  sector 
initiatives  such  as  the  one  announced  by  Dr.  Venter  and  Perkin- 
Elmer.  That  effort  is  particularly  noteworthy  because  it  is  our  un- 
derstanding that  they  will  share  their  data  with  us  promptly,  and 
it  also  comes  at  a  time  when  we  all  collectively  recognize  that  our 
nation  needs  enhanced  sequencing  capacity  so  that  we  can  all  reap 
the  benefits  of  the  human  genome  project  in  terms  of  public  health 
and  medicine. 

Some  of  the  basic  research  that  the  Human  Genome  Program 
has  nurtured,  both  at  The  Institute  of  Genomic  Research  and  else- 
where, laid  the  foundation  for  the  sequencing  approach  that's  been 
proposed  by  the  private  sector  venture.  Such  intellectual  partner- 
ships between  the  public  and  private  programs,  we  believe,  will 
speed  the  completion  of  the  human  genome  project  goals  and  sig- 
nificantly enrich  the  scientific  community  that's  involved  in  the 
project.  As  we  speed  up  the  exploitation  of  the  genomic  informa- 
tion, however,  we  should  be  ever  vigilant  about  the  ethical,  legal, 
and  social  implications  that  we  may  have  to  deal  with.  During  the 
next  few  months  we  will  be  unveiling  the  specifics  of  our  new  5- 
year  plan  that  will  definitely  incorporate  the  new  private  sector 
venture.  The  scientific  community  that  is  involved  in  our  project  is 
on  the  cutting  edge  of  technology  development  and  scientific  dis- 
cover, and  I  have  every  confidence  that  many  more  surprises  await 
us  on  the  road  ahead. 


I  believe  that  these  discoveries  will  happen  at  the  interfaces  be- 
tween the  agencies  that  are  involved  in  the  human  genome  project 
such  as  biology,  information  science,  and  engineering,  and  I  think 
that  our  program  and,  from  the  parochial  point  of  view,  our  labora- 
tories, the  DOE  National  Laboratories,  are  ideally  suited  to  con- 
tribute to  the  discoveries  for  the  benefit  of  our  Nation. 

This  completes  my  prepared  remarks  and  I'll  be  ready  to  answer 
any  questions.  Thank  you. 

[The  prepared  statement  and  attachments  of  Mr.  Patrinos  fol- 
low:! 
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STATEMENT  OF 

DR.  ARI  PATRINOS 

ASSOCIATE  DIRECTOR 

OFFICE  OF  BIOLOGICAL  AND  ENVIRONMENTAL  RESEARCH 

OFFICE  OF  ENERGY  RESEARCH 

DEPARTMENT  OF  ENERGY 

BEFORE  THE 

COMMITTEE  ON  SCIENCE 

SUBCOMMITTEE  ON  ENERGY  AND  ENVIRONMENT 

UNITED  STATES  HOUSE  OF  REPRESENTATIVES 

JUNE  17,  1998 


Mr.  Chairman  and  Members  of  the  Subcommittee: 

I  am  pleased  to  testify  before  the  Subcommittee  on  the  future  of  the  Human  Genome  Project 
(HGP).  The  Department  of  Energy  (DOE)  takes  great  pride  in  its  role  in  this  important  research 
endeavor  that  will  revolutionize  the  field  of  biology  and  help  usher  in  a  new  millennium  of 
wonderful  applications  in  the  fields  of  medicine,  environmental  remediation,  and  sustainable 
development. 

The  DOE  Biological  and  Environmental  Research  (BER)  program  launched  a  pilot  project  in 
1986  to  examine  the  feasibility  of  sequencing  the  three  billion  pairs  of  human  DNA  and  to 
identify  all  the  genes  on  our  twenty-three  pairs  of  chromosomes.  One  of  the  initial  objectives  of 
the  BER  project  was  to  seek  a  molecular-level  understanding  of  the  effects  of  ionizing  radiation 
on  human  biology,  a  goal  that  continues  today.  The  National  Institutes  of  Health  (NIH),  having 
started  its  own  program  in  1988,  joined  DOE  in  the  formal  launch  of  the  HGP  in  1990  and 
together  the  two  agencies  issued  a  five-year  research  plan.  In  1993,  that  plan  was  updated  two 
years  ahead  of  schedule,  due  to  faster  than  expected  progress;  most  notably,  rapid  progress  came 
fi-om  advances  in  physical  mapping  and  in  technology,  and  simultaneously  fi-om  the  unexpected 
pace  of  disease  gene  discovery  that  dramatically  demonstrated  the  value  of  genome-scale 
research.  Last  month,  representatives  from  the  two  agencies  met  with  key  members  of  the 
scientific  community  to  agree  on  the  details  of  the  next  five-year  plan  that  will  be  released  in 
October  1998.  The  plan  will  be  coordinated  with  those  of  our  intemational  partners  (e.g.,  with 
the  United  Kingdom's  Sanger  Center)  as  well  as  with  parallel  private  sector  initiatives  such  as  the 
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recently  announced  venture  by  Perkin-Elmer  and  Dr.  Craig  Venter  of  The  Institute  for  Genomic 
Research  (PE-TIGR). 

At  the  midpoint  of  its  projected  15-year  lifetime,  following  achievement  of  every  milestone  of 
the  1993  plan  on  or  ahead  of  schedule,  the  HOP  is  embarking  on  the  task  of  high  volume  human 
DN  A  sequencing  in  order  to  deliver  the  highly  accurate  sequence  of  an  entire  generic  human 
genome  by  2005;  the  task  has  been  made  possible  by  advances  in  sequencing  and  information 
technologies  and  in  enhanced  access  to  clone  resources.  The  DOE  has  responded  to  the  new 
challenges  of  this  phase  of  the  HOP  by  creating  the  DOE  Joint  Genome  Institute  (JGI),  the 
combination  of  the  DOE  genome  research  centers  at  Los  Alamos,  Lawrence  Berkeley,  and 
Lawrence  Livermore  National  Laboratories.  The  Institute  will  undertake  the  DOE's  share  of  high 
volume  sequencing  at  its  new  production  sequencing  facility  in  Walnut  Creek,  California. 

The  new  five-year  plan  will  describe  the  details  of  the  public  sector  sequencing  strategy  as  well 
as  the  other  elements  of  the  HGP.  In  addition  to  the  pursuit  of  a  complete  m^  of  the  human 
genome,  these  elements  include:  the  further  development  of  sequencing  technologies  that  will  be 
needed  to  use  information  being  generated  in  the  HGP  long  after  the  first  human  sequence  is 
completed  in  2005;  the  creation  of  the  data  bases  that  will  accept  and  process  the  large  amounts 
of  data  generated  by  sequencing;  the  sequencing  of  genomes  of  model  organisms  to  help  us 
understand,  most  efficiently  and  cost  effectively,  the  human  genome;  the  ethical,  legal,  and  social 
implications  (ELSI)  of  the  HGP;  and  the  pursuit  of  some  of  the  biological  applications  that  will 
be  enabled  by  the  completion  of  the  first  reference  or  generic  genome  sequence,  a  sequence 
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comprised  of  DNA  fhim  ten  women  and  ten  men  who  will  be  rigorously  anonymous  and  whose 
informed  consent  will  have  been  fully  assured. 

Progress  in  the  HGP  itself,  together  with  scientific  contributions  fix)m  the  many  HGP  spinoffs  in 
both  the  public  and  private  sector,  will  enable  us  to  include  new  program  goals  that  could  not 
have  been  anticipated  only  a  few  years  ago.  These  unexpected  new  goals  are  consistent  with  the 
history  of  the  HGP  making  bigger  payoffs  and  providing  even  greater  value  than  anticipated, 
both  scientific  and  economic.  Advances  in  technology  will  enable  the  efficient  characterization 
of  the  biological  functional  units  in  every  cell,  the  gene  transcripts  and  their  protein  products. 
Moreover,  rapid  progress  in  determining  the  genomic  sequences  of  model  organisms  such  as 
yeast  (the  first  yeast  genome  was  completed  in  1996),  the  worm,  C.  Elegans,  (scheduled  for 
completion  in  1998),  and  a  rapidly  increasing  number  of  microbes  is  enabling  more  rapid 
characterization  and  discovery  of  human  genes  than  previously  expected.  Progress  in  meeting 
the  sequencing  and  biological  goals  of  the  HGP  will  also  challenge  the  ELSI  component  of  the 
HGP  to  address,  more  quickly,  the  critical  issues  arising  from  the  unexpectedly  rapid  availability 
and  use  of  human  genome  information. 

From  the  beginning,  the  HGP  has  been  focused  on  developing  technologies  and  resources  that 
would  advance  the  science  and  utility  of  the  information  contained  in  the  human  genome.  Thus, 
DOE  welcomes  private  sector  initiatives  such  as  the  PE-TIGR  venture  that  will  add  value  to  the 
public  sector  effort.  This  private  sector  effort  is  particularly  noteworthy  since  it  is  our 
understanding  that  PE-TIGR  intends  to  share  its  data  promptly  with  the  HGP,  and  since  it  comes 
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at  a  time  when  there  is  an  increased  need  for  sequencing  capacity  if  the  Nation  is  to  realize  fully 
the  public  health  and  medical  benefits  of  the  genome  project  as  quickly  as  possible. 

It  is  notable  that  NIH-  and  DOE-funded  basic  research  (at  TIGR  and  elsewhere)  laid  the 
foundation  for  the  sequencing  approach  being  proposed  by  PE-TIGR.  We  do  believe  that  such 
emerging  public-private  intellectual  partnerships  will  speed  completion  of  some  HGP  goals  and 
enrich  the  scientific  community  involved  in  the  HGP.  However,  at  the  same  time,  it  is  important 
that  we  work  to  guarantee  that  HGP  data  acquired  with  public  funds  continue  to  be  made 
available  to  the  scientific  community  at  large  and  that  the  data  is  of  a  quality  that  provides  the 
greatest  scientific  information  and  utility.  The  product  of  the  PE-TIGR  venture  will  contain 
many  gaps,  whereas  the  HGP  has  always  been  committed  to  a  contiguous,  high  quality,  highly 
accurate,  complete  sequence.  Moreover,  there  is  a  critical  need  for  increased  sequencing 
capacity  within  our  academic  and  national  laboratories  to  meet  the  many  public  sector 
sequencing  demands  that  will  follow  the  HGP.  This  information  will  be  revealed  by  sequencing 
the  genomes  of  model  organisms,  such  as  mice,  rats,  and  primates  for  which  we  have  a  rapidly 
growing  wealth  of  biological  information  that  provides  insight  into  how  human  genes  function. 
In  addition,  sequence  information  fi-om  portions  of  the  genomes  of  hundreds  of  individuals  will 
be  needed  to  understand  human  genetic  variation  and  will  serve  as  the  basis  for  developing 
individual-specific  diagnosis  and  therapy,  a  potential  focus  of  21st  Century  medicine. 

The  scientific  community  involved  in  the  HGP  is  truly  on  the  cutting  edge  of  technology 
development  and  scientific  discovery;  and  as  a  result,  surprising  new  discoveries  and  advances 
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can  be  expected  over  the  next  few  years.  Many  of  these  discoveries  will  occur  at  the  interfaces 
of  the  sciences  that  are  involved  in  the  HGP  such  as  biology,  information  science,  and 
engineering.  The  multidisciplinaiy  capabilities  of  our  national  laboratories  are  ideally  suited  to 
contribute  to  these  discoveries.  Together  with  our  NIH  partners  we  strive  to  facilitate  these 
discoveries  and  advances  for  the  benefit  of  the  Nation. 

This  completes  my  prepared  testimony.  I  would  be  happy  to  answer  your  questions. 
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Ari  Patrinos 


Dr.  Patrinos  received  a  diplofna  in  mechanical  and  electrical  engineering  from  the 
National  Technical  University  of  Athens  and  a  PhD  in  mechanical  engineering  and 
astronautical  sciences  from  Northwestern  University.  His  research  included 
atmospheric  turbulence,  computational  fluid  dynamics,  and  hydrodynamic  stability. 
After  a  year  on  the  faculty  of  the  University  of  Rochester  he  joined  Oak  Ridge  National 
Laboratory  in  1976  to  conduct  research  on  energy-related  weather  and  climate 
modification  and  to  develop  humerical  codes  for  loss-of-coolant  (LOC)  nuclear  accident 

simulations  as  well  as  for  river  flows  and  lake  circulations. 

! 

In  1980,  he  joined  Brookhaven  National  Laboratory  to  develop  atmospheric  chemistry 
models  and  to  lead  field  programs  on  wetfall  chemistry.  In  1984,  he  was  detailed  to 
EPA  and  to  the  National  Acid  Deposition  Assessment  Program  (NAPAP)  staff  in 
Washington,  DC.  He  joined  DOE  in  1986,  restructuring  the  Department's  atmospheric 
sciences  program,  and  in  1988  led  the  expansion  of  DOE's  research  effort  in  glot>al 
environmental  change.  He  \yas  the  director  of  the  Atmospheric  and  Climate  Research 
Division  (ACRD)  of  DOE's  Office  of  Biological  and  Environmental  Research  (OBER) 
until  1990.  When  ACRD  was  merged  with  OBER's  Ecological  Research  Division,  he 
became  director  of  the  comblined  Environmental  Sciences  Division. 

From  August  1993  until  March  1995,  Dr.  Patrinos  was  acting  as  the  Associate  Director 
for  Biological  and  Environmental  Research  in  the  Office  of  Energy  Research;  since 
March  1995  he  has  been  the  Associate  Director,  who  oversees  the  research  activities 
including  the  DOE  human  and  microbial  genome  programs,  structural  biology,  nuclear 
medicine  and  health  effects,  iglot)al  environmental  change,  and  basic  research 
underpinning  DOE's  environmental  restoration  effort.  Dr.  Patrinos  represents  DOE  on 
several  subcommittees  of  the  Committee  on  Environment  and  Natural  Resources  of  the 
National  Science  and  Technjology  Council.  He  is  a  member  of  the  American  Society  of 
Mechanical  Engineers,  the  American  Geophysical  Union,  the  American  Meteorological 
Society,  and  the  Greek  Technical  Society. 
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Chairman  Calvert.  Dr.  Collins. 

TESTIMONY  OF  FRANCIS  S.  COLLINS,  M.D.,  DIRECTOR,  NA- 
TIONAL HUMAN  GENOME  RESEARCH  INSTITUTE,  NATIONAL 
INSTITUTES  OF  HEALTH,  U.S.  DEPARTMENT  OF  HEALTH  AND 
HUMAN  SERVICES,  BETHESDA,  MD 

Dr.  Collins.  Thank  you  very  much,  Mr.  Chairman.  I  am  honored 
to  appear  before  this  Committee,  especially  with  the  distinguished 
folks  sitting  at  the  table  with  me.  I  am  Director  of  the  National 
Human  Genome  Research  Institute  which  is  the  part  of  the  Na- 
tional Institutes  of  Health  which  is  devoted  to  the  human  genome 
project,  one  of  22  such  institutes  and  centers  of  the  NIH. 

In  case  you  are  not  familiar  with  the  NIH's  means  of  funding 
science,  let  me  just  quickly  point  out  that  the  funding  that  we  give 
to  the  Human  Genome  Project  is  derived  from  grant  applications 
which  we  get  from  investigators  at  universities,  institutes  and 
some  companies  around  the  country.  They  send  in  their  grant  pro- 
posals to  us.  Those  are  peer  reviewed  and  then  we  select  the  ones 
that  we  think  are  the  most  meritorious  for  funding.  Regrettably  at 
the  present  time,  only  about  one  in  four  approved  applications  is 
funded  but  that  is  where  the  work  of  the  NIH  component  of  the  ge- 
nome project  is  done,  out  there  in  academia,  in  small  companies, 
and  in  institutes. 

I  wanted  to  make  four  points  in  my  brief  opening  statement 
which  are  taken  from  the  written  remarks  which  are  more  exten- 
sive. First  of  all,  Mr.  Chairman,  you  pointed  out  that  there  have 
been  bold  words  spoken  about  the  genome  project.  Let  me  speak  a 
couple  of  them  myself.  As  a  physician  and  a  scientist,  I  do  believe 
that  genetics  has  become  the  core  science  of  medicine.  Whatever 
disease  you're  interested  in  understanding,  genetics  is  now  the 
most  powerful  tool  you  have  to  get  at  the  mysteries  that  still  re- 
main unlocked.  I  also  believe  that  the  genome  project  has  become 
the  center  of  genetics,  this  effort  to  map  and  sequence  all  the  DNA 
of  the  human  and  other  model  organisms  is  very  much  the  focal 
point  of  the  modem  revolution.  So  what  we  are  talking  about  today 
is  the  core  of  the  core.  Its  importance  can  hardly  be  overstated.  I 
do  believe  historians  will  look  at  this  as  the  most  ambitious  and 
important  organized  scientific  effort  that  humankind  has  mounted, 
including  splitting  the  atom  or  going  to  the  moon,  because  this  is 
an  investigation  into  ourselves. 

Second  point:  The  genome  project  has  been  characterized  by  a 
complex,  but  carefully  planned,  agenda  since  the  outset.  There  has 
been  some  misunderstanding  I  believe,  and  perhaps  recently  espe- 
cially in  the  press,  about  what  the  genome  project  aims  to  do.  This 
is  not  just  a  project  to  sequence  human  DNA.  In  its  first  several 
years,  many  of  the  goals  of  the  project  related  to  developing  maps, 
genetic  maps  and  physical  maps,  as  well  as  improving  the  tech- 
nologies in  order  to  be  able  to  afford  to  do  the  human  sequencing 
at  the  pace  that  was  needed  to  complete  the  job  at  the  cost  that 
was  estimated  to  be  available.  So  up  until  now,  in  fact,  only  a 
minor  fraction  of  the  budget  of  the  human  genome  project  has  been 
devoted  to  the  actual  human  sequencing,  the  part  that  is  now 
ramping  up  in  a  major  way  with  10  percent  of  that  now  available 
in  public  database  in  assembled  or  partially  assembled  form. 
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There  is  also  an  emphasis  on  model  organisms  which  has  taught 
us  much  about  how  genetics  predicts  a  particular  kind  of  pheno- 
type  and  which  will  serve  us  well  in  trying  to  understand  what  the 
human  DNA  sequence  means.  And  there  is  our  ELSI  program 
which  Dr.  Patrinos  has  already  mentioned,  looking  at  the  ethical, 
legal,  and  social  implications  of  this  research.  So  the  genome 
project  is  much  broader  than  just  the  human  sequence.  When  we 
look  at  cost  comparisons,  for  instance,  of  this  approach  versus  that 
approach,  it  would  be  important  to  be  sure  we  are  talking  about 
the  same  activities. 

Third  point:  The  genome  project  up  until  now  is  arguably  one  of 
the  more  impressive  success  stories  of  the  federal  investment  in 
science  of  all  time.  Every  milestone  that  has  been  put  forward  by 
carefully  chosen  advisers  outside  the  government  have  been 
achieved  or  exceeded.  The  cost  that  has  gone  into  this  project  is 
roughly  25  percent  less  in  its  first  half  than  was  expected  by  the 
original  planners,  so  it  is  fair  to  say  the  project  has  been  faster, 
better,  and  cheaper  up  until  now  and  we  aim  to  maintain  that 
record. 

As  a  physician  I  can  tell  you  the  consequences  of  this  project  are 
all  around  us.  Back  in  the  1980's,  when  I  was  on  the  faculty  at  the 
University  of  Michigan,  I  spent  almost  10  years  finally  identifying 
the  cystic  fibrosis  gene  and  another  roughly  10  years  participating 
in  a  group  that  found  the  Huntington's  disease  gene.  That  was  the 
best  you  could  do  in  the  1980's.  Nowadays,  it's  a  matter  of  months. 
Just  a  few  months  ago,  a  gene  for  Parkinson's  disease  was  found, 
using  the  tools  of  the  genome  project,  in  9  months,  and  breaking 
open  research  in  that  field  which  has  really  been  frustrating  for  30 
years.  So  this  is  a  success  already.  You  don't  have  to  wait  until  the 
sequence  is  in  hand  to  see  it  happen. 

Fourth  point:  Partnership  with  the  private  sector  is  both  nec- 
essary and  desirable  and  we  welcome  this  new  initiative  which  is 
being  discussed  today  by  Dr.  Venter.  In  fact,  such  public/private 
partnerships  have  characterized  the  genome  project  from  the  out- 
set. There  are  many  other  examples  of  that  sort,  though  perhaps 
none  as  bold  as  this  one.  Again,  we  need  to  look  carefully  at  the 
ways  in  which  this  private  initiative  and  the  publicly-funded  effort 
can  be  complementary  and  we  also  need  to  consider  scientifically 
the  ways  that  the  strategy  is  different,  which  actually  adds  to  the 
complementarily.  And  I  know  Dr.  Olson  will  particularly  comment 
upon  that  in  his  remarks. 

Let  me  assure  you,  we  will  work  together.  If  you  doubt  that,  no- 
tice that  Dr.  Venter  and  I  seem  to  have  worn  the  same  clothes 
today  without  intending  to.  We  are  intending  to  be  partners  in  this 
in  every  possible  way,  so  let  this  be  a  symbol  thereof. 

This  is  not  a  race.  We  will  work  together,  we  believe  in  the  value 
of  that,  we  believe  we  have  complimentary  strategies.  The  federal 
effort  is  fully  prepared  to  adjust  their  strategy.  As  we  move  for- 
ward we  have  a  vigorous  advisory  process  to  do  that,  constituted 
by  some  of  the  world's  best  scientists.  We  have  adjusted  our  strat- 
egy on  a  regular  basis,  based  on  technological  developments,  but  I 
would  argue  that  it's  a  little  soon  to  know  exactly  what  that  adjust- 
ment should  be.  As  Dr.  Venter  will  tell  you,  the  proposal  ^vhich  has 
been  put  forward  is  bold,  but  is  yet  untried,  and  the  quality  of  the 
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product,  a  very  serious  question  because  we  do  believe  we  want  the 
whole  genome  sequence  with  as  few  gaps  as  possible,  as  few  mis- 
takes as  possible,  the  quality  is  so  important  that  one  must  not,  I 
think,  deviate  from  that  goal  or  from  the  strategy  to  get  there  until 
we  have  the  data  in  front  of  us  to  see  how  this  new  approach  will 
work. 

In  that  regard,  we  welcome  a  proposal  by  Dr.  Venter  to  try  out, 
as  a  pilot  effort,  the  genome  sequence  of  the  fruitfly  Drosophila. 
This  effort,  which  will  get  under  way  in  about  6  months,  focuses 
on  an  organism  whose  genome  is  30  times  smaller,  and  much  more 
tractable  and  I  believe  we  will  learn  a  lot  from  that  pilot  effort 
about  the  ways  in  which  this  strategy  can  be  applied  to  the  human. 
At  that  point  it  will  be  easier,  perhaps,  for  the  federal  effort  to 
make  some  predictions  about  ways  that  we  might  adjust  our  strat- 
egy. 

But  to  summarize,  we  welcome  this  development,  we  believe  that 
we  have  a  good  track  record  of  working  together  with  the  private 
sector,  and  I  look  forward  to  seeing  these  two  complimentary  ef- 
forts get  us  there  soon,  which  is  my  goal,  and  should  be  yours. 

[The  prepared  statement  and  attachments  of  Dr.  Collins  follow:] 
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I  am  Dr.  Francis  Collins,  Director  of  the  National  Human  Genome  Research  Institute 
(NHGRI)  of  the  National  Institutes  of  Health.  I  appreciate  the  opportimity  to  appear  before  the 
Subcommittee  today  to  discuss  the  Human  Genome  Project  and  the  implications  of  the  recent 
announcement  by  a  private  company  of  their  intentions  to  carry  out  large-scale  sequencing  of  the 
human  genome. 

The  NHGRI  is  one  of  the  22  Institutes  and  Centers  that  comprise  the  federation  of  federal 
research  entities  known  as  the  National  Institutes  of  Health  (NIH).  The  vast  majority  of  research 
dollars  appropriated  to  the  NIH  flow  out  to  the  scientific  community  across  the  Nation,  primarily 
in  the  form  of  peer-reviewed  research  grants.  Today,  that  community  numbers  more  than  50,000 
investigators  affiliated  with  nearly  2,000  universities,  hospitals,  and  other  research  facilities 
located  in  all  50  states,  the  District  of  Columbia,  Puerto  Rico,  Guam,  the  Virgin  Islands,  and 
certain  points  abroad. 

The  NHGRI  is  the  lead  Institute  at  the  NIH  with  responsibility  for  The  Human  Genome 
Project  (HGP).  The  HGP  officially  began  in  October  of  1990  as  a  15-year  program  to 
characterize  in  detail  the  complete  set  of  human  genetic  instructions  (the  "genome").  The  central 
aim  of  the  project,  which  the  federal  government  funds  through  programs  at  the  NIH's  National 
Human  Genome  Research  Institute  and  the  Department  of  Energy,  is  to  arm  health  researchers 
with  powerful  gene-finding  and  DNA  analysis  tools  to  unravel  and  understand  the  myriad  human 
diseases  that  have  their  roots  in  DNA.  Now  at  its  half-way  mark,  genome  project  tools  have 
underpinned  virtually  all  gene  discoveries  of  this  decade. 

The  Human  Genome  Project's  success  stems  largely  from  a  unique  and  rigorous  plaiming 
process  that  sets  ambitious  research  goals,  time  lines  and  budgets.  The  first  joint  NIH/DOE  plan, 
which  covered  years  1991-1995,  included  goals  for: 

►  physical  and  genetic  maps; 

►  experimental  DNA  sequencing  of  the  fhiit  fly,  a  round  worm,  yeast,  and  the  bacterium 
E.coli: 

►  computer  management  of  research  data;  and 

►  studies  of  the  ethical,  legal,  and  social  implications  (ELSI)  of  these  new  abilities  to  read 
genetic  information 

Because  of  the  rapid  pace  of  genome  research  and  technology  development,  scientists  met 
many  of  those  initial  goals  ahead  of  schedule  and  under  budget.  So  the  research  plan  was 
updated  again  in  1993  to  establish  new  NIH-DOE  goals  through  1998.  All  of  these  goals  have 
now  been  met  or  exceeded.  Original  expectations  were  that  the  NIH  cost  of  these  activities  fi^om 
FY'91-97  would  exceed  $1  billion  in  1991  dollars.  I  am  pleased  to  report  that  the  cost  has  been 
about  25  percent  less  than  that  projection. 
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Gene  Discovery 

Today,  with  Human  Genome  Project  tools,  it  is  possible  to  track  down  a  disease-related 
gene  even  when  nothing  is  known  about  the  biochemical  problems  of  the  disease  or  how  the  gene 
works.  This  technique,  based  on  identifying  the  position  of  a  gene  in  the  chromosome  and  then 
isolating  it,  is  commonly  referred  to  as  positional  cloning  and  was  successfully  used  for  the  first 
time  in  1986.  Now,  the  increasing  detail  and  quality  of  genome  maps  have  reduced  the  time  it 
takes  to  find  a  disease  gene  fi-om  years,  to  months,  to  weeks,  to  sometimes  just  days,  and 
scientists  are  using  the  tools  to  discover  dozens  of  disease  genes  each  year. 

An  Example  -  Parkinson's  Disease 

The  isolation  of  a  gene  for  Parkinson's  disease  (PD)  last  year  demonstrated  the  power  of 
this  new  discovery  method  and  showed  conclusively  that  changes  in  DNA  can  cause  PD  in  some 
families.  Only  two  years  ago,  the  National  Institute  of  Neurological  Disorders  and  Stroke  held  a 
workshop  to  explore  using  genetic  approaches  to  understand  PD.  A  team  led  by  scientists  in 
NHGRI's  Division  of  Intramural  Research  (DIR)  began  large-scale  genetic  analysis  of  DNA  fi-om 
members  of  a  large  Italian  family  containing  almost  600  people,  more  than  60  of  whom  have  been 
diagnosed  with  Parkinson's.  In  nine  days,  NHGRJ  gene  hunters  mapped  the  gene  to  a  region  of 
chromosome  4,  which  contained  approximately  100  genes.  One  of  the  several  genes  in  that 
interval  had  already  been  identified  on  the  gene  map  and  was  known  to  encode  a  protein  called 
alpha-synuclein. 

In  just  a  few  months,  the  researchers  showed  conclusively  that  an  altered  alpha-synuclein 
gene  caused  Parkinson's  disease  in  the  study  families.  Many  have  hailed  this  as  the  most 
significant  advance  in  Parkinson's  disease  research  in  30  years.  Just  last  month,  a  Japanese 
research  team  used  genome  mapping  tools  to  isolate  another  gene,  this  time  on  chromosome  6,  that 
also  appears  to  contain  a  gene  that,  when  altered,  predisposes  the  individual  to  a  rare  juvenile  form 
of  Parkinson's  disease. 

Ethical,  Legal,  and  Social  Implications 

NHGRI  has  established  productive  partnerships  among  consumers,  scientists,  and  policy 
makers  to  help  reduce  the  possibility  that  genetic  information  will  be  used  to  harm  an  individual  or 
family  members  and  ensure  that  it  will  be  of  benefit  to  both  patients  and  providers.  As  an  integral 
part  of  the  Human  Genome  Project,  the  NHGRI  and  the  DOE  have  each  set  aside  a  portion  of  their 
funding  to  anticipate,  analyze,  and  address  the  ethical,  legal,  and  social  implications  (ELSI)  of  the 
Project's  new  advances  in  human  genetics.  The  current  goals  of  the  ELSI  program  are  to  improve 
the  understanding  of  these  issues  through  research  and  education,  to  stimulate  informed  public 
discussion,  and  to  develop  policy  options  intended  to  ensure  that  genetic  information  is  used  for 
the  benefit  of  individuals  and  society.     Because  genetic  information  is  personal,  powerfiil,  and 
potentially  predictive,  it  can  be  used  to  stigmatize  and  discriminate  against  people.  Genetic 
information  must  be  private. 
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DNA  Sequencing 

If  the  letters  representing  the  3  bilHon  bases  in  the  human  genome  were  printed  out  in 
books,  and  the  books  were  stacked  one  on  top  of  the  other,  they  would  reach  as  high  as  the 
Washington  Monument.  The  current  major  goal  of  the  Human  Genome  Project  is  to  read  the  order, 
letter  by  letter,  of  those  3  billion  bases. 

Sequencing  was  once  done  by  hand  as  a  series  of  chemical  reactions^a  slow  and  costly 
method.  In  1990,  when  the  HGP  began,  the  sequencing  cost  was  $10/base.  Now,  because  of 
public  investment  and  collaboration  with  the  private  sector,  machines  read  the  sequence  fragments 
quickly  and  efficiently.  As  a  result,  the  sequencing  cost  has  been  dramatically  reduced  to  roughly 
$.50/base  for  high-quaUty  "finished"  sequence. 

Using  a  strategy  referred  to  as  "shotgun"  sequencing,  an  investigator  takes  each  page  of 
those  books  stacked  as  tall  as  the  Washington  Monument,  and  randomly  cuts  the  text  into  small 
fragments.  These  fragments  are  small  enough  for  sequencing  machines  to  read.  To  get  long 
sfretches  of  contiguous  DNA,  investigators  must  then  reassemble  these  sequenced  fragments  back 
into  sentences,  paragraphs,  chapters,  and  books.  The  reassembly  of  this  puzzle  is  carried  out 
largely  by  sophisticated  computer  programs. 

The  sequencing  strategy  the  public  genome  project  uses  employs  shotgun  sequencing  of 
DNA  fragments  that  already  have  been  carefully  mapped  and  catalogued.  This  process  makes 
reassembling  the  sequenced  fragments  into  contiguous  sequence  easier  because  you  know  where 
the  fragment  came  from.  In  addition,  scientists  periodically  encounter  DNA  fragments  that  are 
particularly  difficult  to  sequence.  To  return  to  the  analogy,  it  is  much  easier,  takes  less  time,  and 
is  less  costly  to  assemble  the  text  in  "finished"  form  if  all  the  fragments  are  known  to  have  come 
from  the  same  chapter. 

In  1996,  NHGRI  began  pilot  projects  to  test  strategies  and  technologies  for  full-scale 
sequencing  of  the  human  genome.  We  now  have  imdertaken  human  sequencing  in  earnest.  As  a 
result,  investigators  have  deposited  almost  150  million  bases  of  "finished"  high-quality  human 
DNA  sequence  in  GenBank,  the  publicly  frinded  database  supported  by  the  National  Library  of 
Medicine.  In  accordance  with  the  agreed-upon  standards  of  the  international  genomic  community, 
all  NIH-DOE  fiinded  sequencers  have  agreed  to  a  rapid  data  release  policy,  such  that,  new 
sequence  data  is  submitted  to  publicly  accessible  data  banks  within  24  hours.  If  one  includes 
"finished"  and  "close-to-finished"  sequence,  over  300  million  bases,  or  10  percent,  of  the  human 
DNA  sequence  has  been  deposited  in  GenBank. 

In  order  to  meet  the  standards  adopted  by  the  international  genomic  commimity,  the 
sequence  produced  must  have  four  characteristics  —the  "4  A's"  of  the  Himian  Genome  Project  ~ 

1)  the  sequence  must  be  accurate,  that  is,  the  DNA  spellings  must  be  correcL  The  publicly 
funded  genome  effort  will  ensure  accuracy  of  99.99  percent  or  better. 
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2)  the  sequence  must  be  assembled.  Large-scale  sequencing  relies  on  the  accurate 
assembly  of  smaller  lengths  of  sequenced  DNA  into  longer,  genomic-scale  pieces,  so  DNA 
will  be  assembled  into  long  pieces  that  reflect  the  original  genomic  DNA. 

3)  Because  human  DNA  sequence  must  also  be  affordable,  a  portion  of  our  research 
funds  focuses  on  technology  development  to  reduce  the  cost  as  much  as  possible. 

4)  Finally,  high-quality,  finished  human  DNA  sequence  must  be  accessible.  In  order  to  be 
useful,  sequence  data  needs  to  be  rapidly  available  to  the  entire  research  community. 

Research  Planning 

Informed  by  a  series  of  workshops  over  the  past  year  that  reviewed  research  progress  and 
identified  genome  research  opportunities,  Human  Genome  Project  leaders  recently  met  with  more 
than  100  representatives  from  a  range  of  scientific  disciplines  to  develop  the  next  5-year  plan, 
scheduled  to  begin  in  the  fall  of  1998.  With  both  the  physical  and  genetic  maps  complete,  and 
human  DNA  sequencing  pilot  projects  underway,  goals  of  the  1998-2003  draft  plan  considered  at 
that  meeting  focused  on: 

completing  a  full,  highly  accurate  and  contiguous  human  genome  DNA  sequence; 

further  development  of  technologies  for  steadily  increasing  sequencing  capacity  and 

reducing  costs; 

studies  of  variations  in  human  DNA; 

studies  of  how  large  sets  of  genes  function; 

studies  of  the  similarities  and  differences  between  the  human  genome  and  those  of 

important  laboratory  animals; 

improved  computer  methods  for  data  management;  and 

studies  regarding  the  ethical,  legal  and  social  implications  of  the  HGP. 

Private  Sector  Developments 

Just  prior  to  the  HGP  planning  meeting,  industry  researchers  fi-om  The  Institute  for 
Genomic  Research  (TIGR)  and  Perkin  Elmer,  Inc.  announced  a  plan  to  apply  a  DNA  sequencing 
strategy  they  had  used  on  micro-organisms  to  produce  a  "rough  draft"  of  the  human  genome 
sequence.  The  sequencing  strategy  recently  proposed  by  Perkin-Elmer,  Inc.  and  TIGR  differs 
from  the  public  effort  in  two  significant  ways:  quality  and  access. 

First,  that  strategy,  called  "whole-genome  shotgun  sequencing",  employs  fragments  that 
have  not  been  previously  mapped  or  catalogued  prior  to  sequencing.  Because  scientists  will  not 
know  where  in  the  long  chain  of  3  billion  base  pairs  the  fragment  might  belong,  the  task  of 
reassembling  the  fragments  becomes  far  more  difficult.  This  difficulty  in  reassembly  inevitably 
will  lead  to  gaps  and  misassemblies  in  the  sequence.  Some  of  these  may  occur  in  DNA  regions 
with  great  biological  significance.  The  private  sector  approach  does  not  propose  to  fill  in  all  the 
gaps  left  by  these  unsequenced  fragments,  thereby  creating  a  product  that  will  be  incomplete  for 
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many  research  uses. 

Secondly,  release  of  sequence  data  from  the  Perkin-Elmer-TIGR  effort  will  occur 
quarterly,  rather  than  daily.  The  policy  of  daily  release  of  DNA  sequence  data  by  publicly- funded 
efforts  was  arrived  at  because  of  the  great  interest  in  the  scientific  community  in  gaining  access  to 
this  highly  valuable  information.  Any  delay  can  result  in  wasted  effort  in  research. 

Deliberations  on  Five-Year  Research  Plan 

Because  the  industry  plan  seemed  to  parallel  some  aspects  of  the  federal  Human  Genome 
Project,  planners  and  advisors  to  the  NIH-DOE  program  have  been  debating  extensively  how  the 
two  proposals  could  be  matched  up.  The  scientists,  at  the  recent  planning  meeting  on  the  draft 
HOP  5-Year  Plan,  concluded  that  while  the  two  projects  should  complement  one  another,  the 
federal  project  should  continue  its  plans  to  provide  high-quality  human  DNA  sequence  as  soon  as 
possible  and  that  all  data  should  be  freely  accessible. 

Those  conclusions  rested  on  a  few  key  factors: 

►  The  industry  effort  may  not  deliver  the  product  in  the  time  and  manner  proposed.  The 
industry  approach  to  sequencing  has  not  been  tried  on  large  and  complex  genomes,  such  as 
the  human,  and  depends  on  newly  developed  and  unproven  machines.  Data  to  evaluate  the 
"whole  genome"  shotgun  approach  will  initially  come  from  a  trial  project  on  the  fioiitfly, 
Drosophila,  but  is  not  expected  on  the  human  for  at  least  12  to  18  months; 

►  The  industry  plan  will  produce  a  large  amount  of  highly  useful  sequence  data,  but  this  plan 
will  yield  a  qualitatively  different  product  that  will  likely  contain  tens  of  thousands  of 
gaps; 

►  The  industry  plan  calls  for  release  of  sequence  data  on  a  quarterly  basis,  and  patenting  of 
100-300  "gene  systems."  While  quarterly  data  release  is  commendable,  the  plan  is  not  as 
sfrong  as  the  standards  established  by  the  international  sequencing  community  which 
require  release  of  data  within  24  hours  and  discourage  patenting.  Further,  some  concerns 
were  expressed  that  the  private  effort's  commitment  to  data  release  might  diminish  over 
time,  if  business  pressiu-es  came  to  the  forefront. 

In  view  of  those  concerns,  advisors  at  the  planning  meeting  enthusiastically  made  several 
unanimous  recommendations: 

►  The  publicly  funded  genome  project  should  continue  with  plans  to  provide  a  complete, 
high-quality  human  DNA  sequence  by  the  year  2005,  and  sooner  if  at  all  possible; 

►  All  possible  steps  must  be  taken  to  ensure  that  all  sequence  data  remain  in  the  public 
domain; 
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►  The  publicly  funded  effort  should  take  advantage  of  technology  advances  to  increase 
sequencing  capacity  as  much  as  possible  as  soon  as  possible  to  meet  research  needs,  both 
for  sequencing  of  the  human  and  model  organisms;  and 

►  The  sequencing  of  DNA  regions  of  high  utility  and  research  interest  should  be  emphasized. 

Now,  Human  Genome  Project  leaders  at  the  NIH  and  DOE  are  considering  that  advice  as 
they  put  the  final  touches  on  the  new  research  plan,  which  will  be  published  in  the  fall  of  1998. 
The  complete  plan  will  contain  details  for  all  of  the  Human  Genome  Project's  goals,  including 
sequencing,  gene  function,  human  variation,  technology  development,  and  Ethical  Legal  and 
Social  Implications. 

The  private  and  public  genome  sequencing  efforts  should  not  be  seen  as  engaged  in  a 
race.  In  fact,  scientists  at  TIGR  and  Perkin-Elmer  have  expressed  their  enthusiasm  for  a  continued 
vigorous  public  effort  on  the  HGP,  and  have  conveyed  their  willingness  to  collaborate  with  NTH 
and  DOE  on  the  production  of  the  complete  human  sequence.  The  NIH  and  DOE  welcome  this 
collaborative  approach,  as  the  whole  should  be  greater  than  the  sum  of  the  parts. 

CoDclusion 

Mr.  Chairman,  I  commend  you,  and  the  Members  of  this  Subcommittee,  for  convening  this 
hearing  today.  The  impact  on  the  future  of  biology  of  knowing  the  order  of  all  3  billion  human 
DNA  bases  has  been  compared  to  Mendeleev's  establishment  of  the  Periodic  Table  of  the 
Elements  in  the  19th  century  and  the  advances  in  chemistry  that  followed.  The  complete  set  of 
human  genes-the  biologic  periodic  table-will  make  it  possible  to  begin  to  understand  how  they 
function  and  interact.  Rapidly  evolving  technologies,  comparable  to  those  used  in  the  semi- 
conductor industry,  will  allow  scientists  to  build  detectors  that  analyze  tens  of  thousands  of  genes 
in  a  single  experiment.  Scientists  will  use  the  powerful  new  tools  to  reveal  the  secrets  of  disease 
susceptibility.  This  knowledge  will  in  turn  allow  researchers  to  create  broad  new  opportxmities  for 
preventive  medicine,  lay  the  foundation  needed  to  develop  and  better  target  effective  therapeutics, 
and  provide  unprecedented  information  about  the  origin  and  migration  of  human  populations. 

The  investment  of  substantial  funds  by  the  private  sector  in  human  sequencing  reaffirms 
the  enormous  value  of  Human  Genome  Project  products  and  is  a  testament  to  the  success  and 
value  of  the  tools  already  developed  by  the  publicly  supported  project.  For  the  reasons  outlined 
above,  it  is  not  yet  knovra  what  role  this  new  endeavor  will  play  over  the  long  term  in  providing 
the  publicly  available,  detailed  "A-to-Z"  instruction  book  ultimately  promised  by  the  Human 
Genome  Project.  Project  leaders  at  the  National  Institutes  of  Health  and  the  Department  of  Energy 
look  forward  to  close  cooperation  with  Perkin-Ehner  and  TIGR  as  the  new  initiative  unfolds  over 
the  next  few  years. 

This  concludes  my  remarks.  I  would  be  pleased  to  answer  any  questions. 


25 


Francis  S.  Collins,  M.D.,  Ph.D.,  Dr.  Francis  Collins  was  appointed  Director  of  the  National  Human  Genome 
Research  Institute  in  April  1993.  NHGRI  oversees  tlie  role  of  the  National  Institutes  of  Health  in  the  U.S.  Human 
Genome  Project. 

Dr.  Colhas  pioneered  the  development  of  a  powerful  gene-finding  method  known  as  "positional  cloning."  which 
utilizes  the  inheritance  pattern  of  a  disease  within  families  to  pinpoint  the  location  of  the  gene  associated  with  the 
disease.  Positional  cloning  is  now  commonly  used  to  isolate  genes  even  when  no  information  about  the  gene's 
function  or  biochemistiy  is  known.  Dr.  Collins  is  perhaps  best  known  for  asing  positional  cloning  techniques  to 
isolate  the  genes  for  cystic  fibrosis,  neurofibromatosis  type  1 ,  Huntington's  disease,  and  ataxia  telangiectasia. 

He  was  formerly  a  Howard  Hughes  Medical  Institute  investigator  and  professor  in  the  Departments  of  Internal 
Medicine  and  Human  Genetics  at  the  University  of  Micliigan  School  of  Medicine  in  Ann  Arbor.  He  was  also 
director  of  the  NCHGR-supported  human  genome  center  at  Michigan. 

Current  active  research  projects  in  the  Collins  laboratory  include  the  develop  of  better  methods  for  analyzing 
mutations  in  disease  genes,  especially  for  the  BRCAl  gene  on  chromosome  17.  The  laboratory  Li  also  involved  in  an 
ambitious  effort  to  map  the  major  genes  contributing  to  adult-onset  diabete.s.  by  carrying  out  extensive  linkage 
analysis  on  affected  siblings,  largely  collected  in  Finland.  Positional  cloning  of  the  gene.s  for  familial  mediterranean 
fever  and  multiple  endocrine  neoplasia  are  also  underway,  in  collaboration  with  other  Investigators. 

Bom  in  Staunton,  'Virginia,  in  1950,  Dr.  Collins  received  his  bachelor  of  science  degree  with  highest  honors  from 
the  University  of 'Virginia.  He  received  both  his  M.S.  and  Ph.D.  degrees  in  physical  chemistry  from  Yale  University 
and  an  M.D.  degree  from  the  University  of  Nonh  Carolina  School  of  Medicine.  He  completed  his  internship  and 
residency  in  internal  medicine  at  the  North  Carolina  Memorial  Hospital.  From  1981  to  1984,  he  was  a  fcUow  in 
human  genetics  and  pediatrics  at  Yale.  He  joined  the  DepartmenLs  of  Internal  Medicine  and  Human  Genetics  at 
Michigan  in  1984,  becoming  professor  in  1991.  He  became  a  Howard  Hughes  Medical  Institute  assistant 
investigator  in  1987  and  full  investigator  in  1991 .  Collins  is  a  diplomate  of  the  American  Board  of  Internal 
Medicine,  the  Anterican  Board  of  Medical  Genetics,  and  the  American  College  of  Medical  Genetics. 

Dr.  Collins  was  elected  to  the  Institute  of  Medicine  in  1991  and  the  National  Academy  of  Sciences  in  1993.  He  is 
also  a  member  of  the  American  Federation  for  Medical  Research,  the  American  Society  for  Clinical  Invesiigadon, 
the  Association  of  American  Physicians,  and  the  international  Human  Genome  OrganiTation.  He  serves  as  an 
associate  editor  for  several  publications,  including  Genomics;  Genes.  Chromosomes  and  Cancer,  Human  Molecular 
Generics:  Somatic  Cell  and  Molecular  Generics;  and  Human  Mutation. 

Among  his  most  recent  awards  and  honors.  Dr.  Collins  has  received  the  Gairdner  Foundation  International  Award, 
the  Young  Investigator  Award  of  the  .'Vmerican  Federation  for  Clinical  Research,  the  Doris  Tulcin  Award  for  Cystic 
Fibrosis  Research,  University  of  Michigan's  Distinguished  Faculty  Achievement  Award,  the  National  Medical 
Research  .\ward,  and  the  University  of  Pittsburgh  Dickson  Priw.  He  holds  honorary  degrees  from  several  academic 
institutions. 
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Chairman  Calvert.  Thank  you,  Doctor. 
Dr.  Venter. 

TESTIMONY  OF  J.  CRAIG  VENTER,  PRESIDENT  AND  DIRECTOR, 
THE  INSTITUTE  FOR  GENOMIC  RESEARCH,  ROCKVILLE,  MD 

Mr.  Venter.  Thank  you  very  much,  Mr.  Chairman.  I  appreciate 
the  opportunity  to  testify  before  your  Subcommittee  about  the  im- 
pact our  new  developments  on  the  federally-funded  human  genome 
effort.  I  also  appreciate  the  comments  of  Dr.  Patrinos  and  Dr.  Col- 
lins. 

I'm  the  founder  and  President  of  The  Institute  for  Genomic  Re- 
search, often  known  as  TIGR,  in  Rockville,  Maryland,  and  I'm  the 
to-be  President  of  the  new  company  we're  forming,  I'm  a  co-founder 
of  that  company  along  with  Tony  White  and  Mike  Hunkapiller  of 
the  Perkin-Elmer  Corporation.  Recent  publicity  about  our  new  ven- 
ture to  sequence  the  human  genome  in  3  years  has  lead  to  specula- 
tion that  funding  for  the  human  genome  effort  should  be  reduced 
or  eliminated.  Nothing  could  be  further  from  the  truth.  Upon  com- 
pletion of  today's  hearing,  I  hope  it's  clear  that  this  new  private 
venture,  and  the  federally-funded  project  are,  in  fact,  complimen- 
tary efforts  that  can  work  together  to  make  unprecedented  impact 
on  improving  research  on  human  health. 

One  goal  of  our  new  to-be-named  company  is  to  sequence  the 
human  genome  over  3  years,  using  dramatic  new  technology  devel- 
oped by  Mike  Hunkapiller's  team  at  the  Perkin-Elmer  Corporation 
in  strategies  that  have  been  developed  by  myself  and  my  colleagues 
at  The  Institute  for  Genomic  Research  for  sequencing  whole 
genomes.  I  agree  with  the  comments  of  Dr.  Collins  that  the  focus 
has  been  lost  in  the  purpose  of  obtaining  the  human  genome  se- 
quence. And  it  was  concentrating  on  what  was  perceived  to  be  an 
absolutely  monumental  task  of  obtaining  that  sequence,  due  to  the 
limits  and  technologies  and  procedures  that  we've  had  in  the  past. 
Analogies  to  the  Manhattan  Project  and  Apollo  Project  are  often 
used.  Billions  of  dollars  from  the  U.S.  Government  and  Europe  and 
Japan,  decades  of  work  from  thousands  of  scientists  around  the 
world,  were  thought  to  be  required  to  obtain  that  sequence.  New 
technologies  and  strategies  now  change  and  replace  some  of  these 
assumptions.  The  human  genome  will  be  accurately  and  completely 
covered  in  one  facility  by  a  new  company  in  Rockville,  Maryland, 
with  a  few  hundred  workers  using  new  technology. 

Our  effort  has  been  described  by  some  as  a  rough  draft  or  worse 
of  the  human  genome  but  I've  heard  these  comments  before  in  1994 
when  Nobel  Laureate  Ham  Smith  and  I  proposed  the  new  strategy 
for  sequencing  genomes.  In  fact  the  first  genome  in  history  that  we 
published  in  Science  in  1995  was  done  with  this  approach.  The  ge- 
nome review  panel  involving  NIH  funding  rejected  our  grant  as 
being  impossible  and  that  we'd  have  a  large  number  of 
noncloseable  gaps  and  misassembled  pieces  of  the  genome  and  at 
the  best  the  sequence  would  be  an  incomplete  and  full  of  holes. 
They  were  clearly  wrong. 

TIGR  is  the  only  organization  in  the  world  to  have  completely 
sequenced  more  than  one  genome.  In  fact,  we've  completed  seven, 
including  the  first  three  and  those  seven  represent  half  of  the  en- 
tire world's  complement  of  completed  genomes.  All  seven,  plus  five 
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more  to  be  finished  this  year  by  us,  were  done  by  the  whole  genome 
shotgun  approach.  Our  sequences  are  some  of  the  highest-quality 
sequences  ever  completed  and  published.  More  than  a  dozen  patho- 
gen genome  projects  are  now  under  way  at  TIGR,  including  the 
malaria  genome  with  funding  by  the  National  Institutes  of  Health. 
I  should  point  out  that  the  Department  of  Energy  using  slightly 
different  review  processes  funded  TIGR  to  sequence  two  out  of 
three  of  the  first  genomes  completed  in  history  and  that  funding 
was  obtained  prior  to  the  completion  of  the  Hemophilus  influenza 
sequence  in  1995. 

The  DOE  has  also  funded  TIGR  to  sequence  more  than  a  dozen 
key  environmental  genomes,  using  the  whole  genome  shotgun 
method,  and  the  Department  of  Energy  has  also  funded  the  bac- 
terial artificial  chromosome  in  sequencing  strategy  that  is  provid- 
ing the  scafiblding  for  assembling  the  entire  human  genome  se- 
quence. I'm  here  to  urge  you  not  only  to  not  cut  the  DOE  or  other 
genome  budgets  because  of  our  announcement  and  effort,  but  to  ac- 
tually consider  increasing  it. 

Having  the  complete  genome  moves  forward  all  the  issues  associ- 
ated with  genomics.  The  sequence  is  the  beginning  of  the  genome 
project.  It  is  absolutely  not  the  end  of  anything,  except,  perhaps, 
the  end  of  ignorance.  A  private/public  partnership  will  not  only  en- 
sure completion  of  the  genome  sequence  sooner,  it  will  provide  the 
basis  for  beginning  the  key  aspects  of  the  genome  project,  for  exam- 
ple, understanding  what  the  sequence  means. 

Because  our  effort  is  moving  forward  substantially  the  timetable 
for  completing  the  genome  sequence,  the  resources  for  understand- 
ing the  genomic  code  become  even  more  important.  With  compara- 
tive genomes,  we've  learned  this  in  microbial  genome  sequences, 
having  one  genome  was  fantastic,  having  two  or  three  was  phe- 
nomenal and  aided  our  understanding.  That's  the  situation  with 
human  and  that's  part  of  the  existing  plan  to  do  the  mouse  and 
other  genomes.  We  need  those  genomes  to  understand  and  inter- 
pret the  human  genome.  By  working  together,  DOE,  NIH,  and 
other  public  and  private  institutions  can  help  meet  the  goal  of  hav- 
ing a  complete  map  and  sequence  of  the  human  genome  within 
three  years.  I  see  that  as  an  announcement  that  everybody  can  be 
proud  of. 

I  hope  that  after  this  hearing  you  will  view  our  announcement 
in  the  federal  program,  for  which  you  are  responsible,  not  as  an  ei- 
ther/or proposition,  but  instead  will  focus  on  how  these  two  activi- 
ties, working  in  tandem,  can  ultimately  improve  our  lives  and 
those  of  generations  to  come. 

This  concludes  my  remarks  and  I'm  pleased  to  answer  any  ques- 
tions you  may  have. 

[The  prepared  statement  and  attachments  of  Mr.  Venter  follow:] 
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Mr.  Chairman,  I  appreciate  the  opportunity  to  testify  today  before  your  subcommittee 
about  the  impact  of  private  sector  developments  on  the  federally-funded  Human  Genome 
Project.  Recent  publicity  surrounding  the  intent  announced  by  Perkin-Elmer  and  me  to 
sequence  the  human  genonie  has  led  some  to  speculate  that  federal  funding  for  the  human 
genome  is  no  longer  needed.  Nothing  could  be  further  from  the  truth.  The  Human 
Genome  Project  is  truly  a  success  that  both  the  scientific  community  and  the  federal 
government  can  look  upon  with  pride  which  will  continue  to  generate  important 
information.  I  am  pleased  to  be  here  to  put  in  context  the  role  that  I  have  played  up  until 
now,  and  the  role  the  I  hope  to  play  in  the  future.  I  hope  after  today  you  will  recognize 
the  success  of  the  program  that  you  have  funded,  and  also  recognize  the  vast  potential  to 
improve  human  health  that  lies  just  around  the  comer  by  linking  both  the  federally- 
funded  initiative  and  our  new  private  sector  venture. 

I  am  J.  Craig  Venter,  President  and  Director  of  The  Institute  for  Genomic  Research 
(TIGR),  an  independent,  not-for-profit  research  institute  in  Rockville,  MD  that  I  founded 
in  1992  after  leaving  the  National  Institutes  of  Health  (NIH).  On  May  1 1,  The  Perkin- 
Elmer  Corporation,  the  largest  producer  of  DNA  sequencing  technologies  in  the  U.S.,  and 
I  announced  a  new  venture  to  create  a  company  that  will  sequence,  as  part  of  its  initial 
projects,  the  Drosophila  (fruit  fly)  genome  and  the  human  genome  within  the  next  three 
years.    These  two  sequencing  projects  will  be  undertaken  using  breakthrough  DNA 
sequencing  technology  developed  by  Perkin-Elmer,  and  a  DNA  sequencing  strategy  that 
was  pioneered  by  my  colleagues  and  me  at  TIGR,  known  as  the  whole-genome  shotgun 
sequencing  method. 

This  announcement  is  very  exciting  for  both  the  public  and  private  scientific  communities 
throughout  the  world,  but  it  is  of  particular  significance  to  the  United  States  because  it  is 
the  validation  of  the  scientific  claims  of  the  Human  Genome  Project,  that  was  first 
discussed  over  14  years  ago  and  funded  for  the  last  ten  years  by  U.S.  taxpayers. 
However,  I  believe  that  in  order  for  me  to  explain  this  comment  and  adequately  answer 
the  question  that  is  the  reason  for  today's  hearing,  it  is  important  to  discuss  the  events  that 
made  our  announcement  possible. 

NIH,  ESTs,  AND  TIGR 

When  I  was  at  NIH,  I  was  a  Section  Chief  at  the  National  Institute  for  Neurological 
Disease  and  Stroke  (NINDS).  My  lab  was  involved  in  a  large  scale  chromosome 
sequencing  effort  to  discover  genes  associated  with  neurological  functioning  and  disease. 
During  this  research,  my  colleagues  and  I  developed  a  new  strategy  for  identifying  genes 
more  rapidly  and  at  much  less  expense  than  previously  had  been  possible.  Prior  to  the 
development  of  this  new  strategy  we  had  labored  for  many  years  using  "traditional" 
sequencing  methods  to  identify  a  few  genes.  In  my  own  case  ,  I  spent  ten  years  on  the 
gene  for  the  adrenalin  receptor.  With  the  new  strategy  we  greatly  exceeded  the  work  of 
many  previous  years  of  effort  in  just  a  few  months.  This  new  strategy  known  as 
Expressed  Sequence  Tags  (ESTs)  was  published  in  the  joumal  Science  in  June  1991 
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(Complementary  DNA  Sequencing:  "Expressed  Sequence  Tags"  and  the  Human  Genome 
Project.  Science  252,  1 65 1  - 1 656  ( 1 99 1 )).  At  the  time  of  this  pubUcation,  fewer  than 
2,000  of  the  60,000  to  80,000  human  genes  were  known. 

It  is  important  to  note  that  this  new  strategy  was  more  than  just  creative  thinking  on  the 
part  of  the  federally-funded  scientists  in  my  lab.  It  also  included  a  significant  role  played 
by  a  new  technology  company  with  which  we  had  begun  to  collaborate.  In  the  late 
1980's,  Applied  Biosystems  manufactured  a  new  DNA  sequencing  technology  that 
greatly  improved  the  speed  with  which  a  DNA  sequence  could  be  obtained.  My  NIH  lab 
entered  into  a  CRADA  with  this  firm  and  worked  with  them  to  improve  their  technology. 
In  fact,  this  was  the  first  CRADA  entered  into  by  NIH  with  a  commercial  organization. 

By  linking  my  lab's  new  EST  strategy  with  Applied  Biosystem's  sequencing  technology 
it  became  possible  to  greatly  improve  the  speed  with  which  new  genes  and  DNA 
sequences  in  general  could  be  identified.  While  our  new  strategy  was  not  yet  widely 
accepted,  I  learned  that  orders  for  the  Applied  Biosystems  DNA  sequencers  that  we  used 
in  our  experiments  had  skyrocketed.  So  there  was  clearly  significant  movement  on  the 
part  of  both  academic  and  commercial  institutions  to  adopt  this  new  technique  detailed  in 
the  Science  publication. 

About  a  year  earlier.  Congress  had  provided  the  initial  funding  to  the  Department  of 
Energy  (DOE)  and  NIH  for  the  Human  Genome  Project  (HGP).  From  its  inception, 
major  technical  innovations  were  considered  essential  to  the  success  of  the  project  and 
our  new  strategy  was  a  significant  step  forward.  In  fact,  the  gene  discovery  phase  of  the 
project  could  be  shortened  to  almost  one-tenth  of  the  originally  anticipated  timeframe. 
However,  there  were  many  other  hurdles  to  clear. 

Obviously  with  this  exciting  new  strategy  I  was  eager  to  scale  up  our  research  program  at 
NIH  in  order  to  implement  a  successful,  large-scale  genome  sequencing  and  gene 
discovery  program.  However,  the  extramural  genome  community  did  not  want  genome 
funding  being  used  on  intramural  programs.  In  addition,  there  was  growing  controversy 
surrounding  the  issue  of  the  U.S.  government  patenting  ESTs  that  I  discovered.  I  was 
frustrated  that  I  would  be  unable  to  participate  in  the  revolution  in  biology  that  we  had 
helped  start.  I  did  not  want  to  leave  NIH,  but  after  much  soul-searching  I  felt  it  was  the 
most  appropriate  option. 

In  1992,  with  funding  from  the  venture  capital  community,  I  formed  TIGR  as  an 
independent,  not-for-profit  research  institute  to  implement  the  programs  that  I  had 
envisioned  for  my  lab  at  NIH.  In  short  order,  we  utilized  the  EST  strategy  to  identify 
more  than  half  of  the  genes  in  the  human  genome  and  published  this  information  in  the 
Human  Genome  Directory  in  the  journal  Nature  in  1995  (Initial  Assessment  of  Human 
Gene  Diversity  and  Expression  Patterns  Based  Upon  52  Million  Basepairs  of  cDNA 
Sequence.  Nature  377  suppl.,  3-174  (1995)).  Also  in  1995,  using  a  new  strategy  for  DNA 
sequencing  that  we  pioneered,  known  as  the  whole-genome  shotgun  approach,  TIGR 
published  the  first  complete  sequence  of  a  self-replicating,  living  organism,  Haemophilus 
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influenzae,  a  bacteria  that  causes  ear  infections  in  children  (Whole-Genome  Random 
Sequencing  and  Assembly  of  Haemophilus  influenzae  Rd.  Science  269. 496-512  (1995). 

In  the  time  since  then,  TIGR  has  become  one  of  the  leading  genomics  institutions  in  the 
world  by  determining  the  complete  DNA  sequence  for  six  other  organisms.  Most 
recently,  we  published  the  sequences  for  the  pathogen  that  causes  Lyme  disease,  Borrelia 
burgdorferi,  and  the  bacteria  that  causes  stomach  ulcers,  Helicobacter  pylori  (  Genome 
Sequence  of  the  Lyme  Disease  Spirochaete,  Borrelia  burgdorferi.  Nature  390.  580-586 
(1997),  The  Complete  Genome  Sequence  of  the  Gastric  Pathogen  Helicobacter  pylori. 
Nature  38L  539-547  (1997)).  We  have  also  published  the  DNA  sequence  for 
Methanococcus  jannaschii,  the  first  archaeal  genome  to  be  sequenced,  funded  by  the 
Department  of  Energy  (DOE),  and  we  will  soon  be  publishing  the  third  DOE-funded 
genome,  Deinococcus  radiodurans  (The  Complete  Genome  Sequence  of  the 
Methanogenic  Archeon,  A/er/ianacocc/«yanmuc/«i.  Science  273.  1058-1073(1996).  No 
other  institution  in  the  world  has  completed  more  than  one  genome. 

TIGR  has  also  been  funded  to  sequence  human  chromosome  16  by  the  NIH  as  one  of  the 
genome  sequencing  centers  funded  through  the  National  Human  GenonK  Research 
Institute  (NHGRI).  In  support  of  this  effort,  DOE  has  funded  TIGR  to  generate  sequence 
from  the  ends  of  600,(X)0  B  ACs  (bacterial  artificial  chromosomes)  that  will  form  a 
scaffold  linking  the  human  genome  sequence  together. 

PE  APPLffiD  BIOSYSTEMS 

During  this  same  timeframe  Applied  Biosystems  had  grown  as  well.  The  continued 
expansion  of  the  Human  Genome  Project,  and  the  use  of  genomics  for  research  in  other 
areas  of  biology  created  huge  demand  for  DNA  sequencers.  Between  1987  and  1997, 
more  than  6,000  ABI  sequencing  systems  had  been  sold,  giving  them  the  largest  installed 
base  of  automated  sequencers  in  the  world. 

In  1993,  Perkin-Elmer,  a  U.S.-based  scientific  instrument  manufacturer,  acquired  Applied 
Biosystems  and  renamed  it  PE  Applied  Biosystems.  Perkin-Elmer  made  a  significant 
investment  in  the  life  sciences  with  its  acquisition  of  Applied  Biosystems  and  it  has 
continued  to  enhance  this  investment  by,  for  example,  investing  over  $100  million  in  the 
last  year  for  research  and  development  to  ensure  that  it  continues  to  develop  new,  cutting 
edge  technologies.  It  is  one  of  the  these  new  technologies,  the  ABI  Prism  3700,  that  will 
be  used  for  this  new  venture. 

THE  HUMAN  GENOME  PROJECT  AND  DNA  SEQUENCING 

As  I'm  sure  you  are  all  familiar,  the  Human  Genome  Project  has  continued  to  be  funded 
through  the  DOE  and  NIH  and  is  now  entering  its  ninth  year.  This  project  was  officially 
launched  in  1990  as  a  $3  billion,  15-year  federal  initiative  to  map  and  sequence  the 
complete  set  of  human  chromosomes  and  those  of  several  model  organisms.  This  project 
was  a  huge  boost  to  the  scientific  community  and  represents  a  project  that,  when 
completed,  could  have  much  greater  significance  to  our  society  than  landing  on  the  moon. 
As  a  result  of  this  commitment  made  by  the  U.S.  government,  our  biotechnology 
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industry,  which  is  holding  its  annual  meeting  in  New  York  City  this  week,  leads  the 
world  both  in  the  science  it  undertakes,  the  jobs  it  creates,  and  the  products  it  delivers  to 
improve  human  health. 

Last  month  a  working  group  completed  a  review  of  the  draft  for  the  next  five-year  plan  of 
the  Human  Genome  Project.  The  program  continues  to  move  forward  and  has  made  great 
strides.  When  it  was  conceived,  very  few  other  organizations,  either  public  or  private 
recognized  the  value  that  this  activity  would  have  in  the  scientific  and  broader 
communities.  Now,  largely  through  the  success  of  this  relatively  small  federal  program, 
whole  pharmaceutical  companies  are  restructuring  their  drxig  discovery  and  development 
process  based  on  genomics. 

Unfortunately,  when  the  Human  Genome  Project  was  initially  explained  to  the  Congress 
and  other  organizations  a  misunderstanding  occurred,  and  the  NIH  Director,  Dr.  Harold 
Varmus,  pointed  this  out  at  the  press  briefing  we  held  last  month  to  announce  our  new 
venture.  The  scientists  who  helped  organize  this  program  indicated  that  sequencing  the 
human  genome  was  the  key  to  improving  our  knowledge  of  human  biology.  This 
statement  has  led  many  to  believe  that  obtaining  the  complete  human  DNA  sequence 
would  mark  the  end  of  the  project.    In  fact,  the  acquisition  of  the  sequence  is  only  the 
beginning.  The  sequence  information  provides  a  starting  point  from  which  the  real 
research  into  the  thousands  of  diseases  that  have  a  genetic  basis  can  begin.  So,  the  sooner 
we  can  get  to  this  starting  point,  the  sooner  we  can  begin  to  see  a  payoff  in  ultimately 
improving  human  health. 

THE  NEW  VENTURE  AND  ITS  GOALS 

As  I  earlier  indicated,  our  announcement  last  month  to  sequence  the  human  genome 
within  the  next  three  years  has  been  widely  reported  in  both  the  scientific  and  popular 
press.  Like  the  federally-funded  project,  it  captures  the  imagination.  Like  the  federally- 
funded  project,  our  goal  is  not  to  obtain  the  sequence  for  its  own  sake,  but  to  obtain  it  to 
serve  as  a  foundation  of  data  upon  which  new  research  into  human  health  can  be  built. 
The  goal  is  to  develop  the  definitive  resource  of  genomic  and  associated  medical 
information  that  will  be  used  by  scientists,  in  both  the  public  and  private  sectors,  to 
develop  a  better  understanding  of  the  biological  processes  in  humans  and  to  deliver 
improved  health  care  in  the  future. 

In  addition,  this  new  company  intends  to  build  the  scientific  expertise  and  informatics 
tools  necessary  to  extract  valuable  biological  knowledge  from  this  data.  This  will  include 
discovering  new  genes,  developing  polymorphism  assay  systems,  and  developing  a 
variety  of  databases. 

There  is  value  in  obtaining  the  sequence  of  the  human  genome  as  quickly  as  possible—not 
for  the  sequences  themselves,  but  for  the  new  research  opportunities  it  will  create.  There 
is  a  significant  infrastructure  already  in  place  in  public  sector  research  institutions  that 
will  greatly  benefit  from  this  data.  Meanwhile,  the  pharmaceutical  and  biotechnology 
industries  recognize  that  the  human  genome  will  be  the  significant  resource  for  future 


33 


drug  discovery  and  development.  Most  important,  we  believe  that  access  to  this 
information  is  valuable  because  it  will  ultimately  transform  the  fundamentals  of 
healthcare  delivery  and  medical  practice  and  improve  the  lives  of  millions  of  people. 

The  development  of  a  new,  fully-automated  sequencer  by  Perkin-Elmer,  coupled  with  the 
whole-genome  shotgun  strategy  will  reduce  the  costs  of  operating  labor  and  reagents, 
while  it  increases  the  speed  with  which  sequences  can  be  generated.  By  building  on  the 
resources  that  have  already  been  developed,  such  as  the  significant  resource  funded  by  the 
DOE  to  sequence  the  ends  of  B ACs,  we  have  a  framework  for  linking  the  human  genome 
together,  the  mechanism  for  verifying  the  alignments  of  sequences  on  individual 
chromosomes  and  internal  controls  for  ensuring  the  quality  of  the  information  that  this 
venture  will  generate. 

The  aim  of  our  project  is  to  produce  a  highly  accurate,  ordered  sequence  that  spans  more 
than  99.9%  of  the  human  genome.  The  accuracy  of  this  sequence  will  be  comparable  to 
the  standard  now  used  in  the  genome  sequencing  community  of  fewer  than  one  error  in 
10,000  base  pairs.  We  look  forward  to  working  with  other  genome  centers  to  ensure  that 
the  sequence  meets  the  requirements  of  the  scientific  community  for  accuracy  and 
completeness. 

DATA  AVAILABILITY  AND  INTELLECTUAL  PROPERTY 

A  fact  that  has  often  been  overlooked  or  questioned  in  the  press  accounts  of  this  venture 
is  that  an  essential  feature  of  the  new  company's  business  plan  is  to  provide  public 
availability  of  the  sequence  data.  A  major  consequence  of  the  analysis  of  data  generated 
by  this  project  will  be  the  creation  of  a  comprehensive  human  genomic  database. 
Because  of  the  importance  of  this  information  to  the  entire  biomedical  research 
community,  key  elements  of  this  database,  including  primary  sequence  data,  will  be  made 
available.  In  this  regard  we  will  work  closely  with  national  DNA  repositories  like  the 
National  Center  for  Biotechnology  Information. 

it  is  our  plan  to  release  data  into  the  public  domain  at  least  every  3  months  including  the 
complete  human  genome  sequence  at  the  end  of  the  project.  We  also  anticipate  providing 
a  connect  fee  for  online  access  to  these  data  and  many  of  the  informatics  tools  that 
researchers  can  use  to  interpret  them.  We  will  also  market  the  database  system  to 
commercial  companies  engaged  in  pharmaceutical  and  biotechnology  research. 

A  concern  that  has  been  raised  in  many  publications  is  how  the  intellectual  property 
issues  associated  with  generating  the  entire  human  genome  sequence  will  be  handled. 
First,  let  me  just  say  that  I  have  been  associated  with  intellectual  property  issues  related  to 
DNA  sequences  from  the  beginning  and  have  great  appreciation  for  the  sensitivities  of 
this  concept.  By  making  the  sequence  of  the  entire  human  genome  available  it  makes  it 
virtually  impossible  for  any  single  organization  to  own  its  entire  intellectual  property.  It 
eliminate  the  entire  speculative  nature  that  is  currently  associated  with  patenting  DNA 
sequence  information  and  requires  that  researchers  understand  the  biology  of  a  sequence 
before  they  file  a  patent  application. 
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Our  actions  will  make  the  human  genome  unpatentable.  We  expect  that  this  primary 
data  will  be  used  by  us  and  others  as  a  starting  point  for  additional  biological  studies  that 
could  identify  and  define  new  pharmaceutical  and  diagnostic  targets.  Once  we  have  fully 
characterized  important  structures  (including,  for  example,  defining  biological  function), 
we  expect  to  seek  patent  protection  as  appropriate.  Given  the  complexity  and  scope  of 
the  information  found  in  the  human  genome  sequence,  we  expect  our  efforts  to  be 
focused  on  100  to  300  targets  from  among  the  thousands  of  potential  targets. 

CAN  THE  HUMAN  GENOME  BE  SEQUENCED  IN  3  YEARS? 

Another  question  that  I  have  been  asked  frequently  is,  can  the  whole-genome  shotgun 
strategy  even  work  with  a  genome  the  size  of  the  human  genome?  It  is  our  hypothesis 
that  this  approach  will  be  successful.  In  fact,  we  plan  test  the  effectiveness  of  this 
strategy  by  collaborating  with  Gerald  Rubin  of  the  Howard  Hughes  Medical  Institute  and 
the  University  of  California  at  Berkeley  and  the  Berkeley  Drosophila  Genome  Project  to 
sequence  Drosophila,  another  large  and  complex  genome,  while  we  establish  the 
infrastructure  for  the  larger  human  effort.  In  addition,  this  genome  will  provide  us 
significant  insights  into  the  biology  of  another  model  organism. 

IMPACT  ON  THE  FEDERALLY-FUNDED  HGP 

Finally,  there  is  the  concern  that  has  brought  us  before  you  today.  How  will  this  new 
private  venture  impact  the  federally-funded  Human  Genome  Project?  It  is  our  sincere 
hope  that  this  program  complements  the  broader  scientific  efforts  to  define  and 
understand  the  information  contained  in  our  genome.  We  recognize  that  our  effort  would 
not  even  be  possible  if  not  for  the  efforts  of  those  in  academia  and  government  who 
conceived  and  initiated  the  Human  Genome  Project.  In  fact,  the  knowledge  gained  from 
this  effort  will  provide  the  key  to  deciphering  the  genetic  contribution  to  thousands  of 
human  conditions  and  substantiates  and  underscores  the  need  to  increase  the  government 
investment  in  further  understanding  of  the  human  genome. 

I  have  heard  from  different  sources  that  our  new  venture  indicates  that  the  federally- 
funded  program  has  been  a  waste  of  money.  I  cannot  state  emphatically  enough  that  our 
announcement  should  not  be  the  basis  for  this  claim.  Let  me  explain  this  by  way  of  an 
example.  Recently,  the  genome  of  yeast,  S.  cerevisiae,  was  completed.  This  genome  was 
begun  before  the  whole  genome  shotgun  strategy  was  developed  and  as  a  result  it  took 
many  years  to  complete.  Literally  thousands  of  scientists  worked  on  this  project.  Does 
the  fact  that  a  faster  way  to  obtain  the  sequence  of  the  organism  they  were  working  on 
render  their  work  meaningless?  Likewise,  this  new  technology  and  strategy  we  have 
announced  would  have  allowed  us  to  sequence  the  first  genome,  H.  influenzae,  much 
more  quickly.  This  fact  does  not  diminish  the  importance  of  obtaining  the  sequence  of 
this  organism. 

By  increasing  the  speed  with  which  the  sequence  of  the  human  genome  will  be  obtained, 
we  have  not  brought  any  program  to  completion.  We  have  only  helped  get  everyone  to 
the  starting  line  a  little  bit  sooner.  The  real  race  is  the  one  that  confronts  us  each  and 
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every  day,  and  that  is  the  one  to  develop  treatments  that  will  help  end  human  suffering 
brought  on  by  the  thousands  of  diseases  that  plague  humanity. 

The  impact  that  our  new  venture  will  have  on  the  federally-funded  Human  Genome 
Project  should  be  to  re-orient  it  sooner  to  move  beyond  DNA  sequencing  into  the 
research  that  will  help  us  better  understand  and  treat  these  diseases. 

It  is  not  appropriate  to  judge  the  relevance  of  the  Human  Genome  Project  on  the  basis  of 
our  announcement  in  a  retrospective  fashion.  Without  the  past  we  could  not  be  here 
today.  However,  it  is  appropriate  to  judge  the  program's  relevance  in  light  of  our 
announcement,  and  others  that  may  come,  by  the  its  ability  to  adapt  and  work  with  new 
initiatives  rather  than  compete  against  them. 

In  effect,  this  new  venture  is  the  private  sector  recognition  of  the  importance  of  the 
Human  Genome  Project.  By  working  closely  together,  NIH,  DOE  and  other  public  and 
private  institutions  can  help  meet  the  goal  of  having  a  complete  map  and  sequence  of  the 
human  genome  sooner  than  anyone  ever  imagined. 

There  are  many  other  issues  that  completing  the  sequence  of  the  human  genome,  as  well 
as  other  genomes,  will  raise  in  the  very  near  future.  This  increased  knowledge  of 
evolution,  and  ultimately  ourselves,  will  likely  prompt  many  questions  that  society  has 
never  even  considered.  If  anything,  this  new  information  will  require  us  to  strengthen  our 
scientific  infrastructure  and  improve  scientific  education.  We  must  work  to  ensure  that 
the  science  is  of  the  highest  quality,  appropriately  interpreted  and  peer  reviewed.  If  these 
areas  are  addressed,  I  believe  we  can  appropriately  assimilate  the  wealth  of  new 
knowledge  and  technology  that  genomics  will  provide. 

CONCLUSION 

As  I  said  at  the  outset,  I  see  the  announcement  of  this  new  venture  as  one  for  which 
everyone  can  be  proud.  It  includes  the  federal  government  taking  the  initiative  to  begin  a 
significant  program  which  is  then  made  more  successful  by  individual  creativity  and 
ingenuity,  and  ultimately  is  validated  by  support  from  the  private  sector.  I  hope  that  after 
this  hearing  you  view  both  our  announcement  and  the  federal  program  for  which  you  are 
responsible  as  not  an  "either/or"  proposition,  but  instead  focus  on  how  these  two 
activities  working  in  tandem  can  ultimately  improve  our  lives  and  those  of  the 
generations  to  come.  Thank  you. 
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9712  Medical  Center  Drive 

Rockville,  MD  20850 

301-838-3500 

J.  Craig  Venter,  Ph.D.,  is  the  Founder,  President  and  Director  of  The  Institute  for  Genomic  Research 
(TIGR),  a  not-for-profit,  tax  exempt  basic  research  institute  in  Rockville,  Maryland.   Between  1984  and 
the  formation  of  TIGR  in  1992,  Dr.  Venter  was  a  Section  Chief,  and  a  Lab  Chief,  in  the  National  Institute 
of  Neurological  Disorders  and  Stroke  at  the  National  Institutes  of  Health  (NIH).    In  1990,  Dr.  Venter 
developed  a  new  strategy  for  gene  discovery.  This  called  expressed  sequence  tags  (ESTs)  and  has 
revolutionized  the  biological  sciences.   Over  72%  of  all  accessions  in  the  public  database  GenBank  are 
ESTs  from  a  wide  range  of  species  including  humans,  plants  and  microbes.   Using  the  EST  method  Dr. 
Venter  and  the  scientists  at  TIGR  have  discovered  and  published  over  one  half  of  all  human  genes.   Out 
of  new  algorithms  developed  to  deal  with  100,000's  of  sequences  TIGR  developed  the  whole  genome 
shotgun  method  that  led  to  TIGR  completing  the  first  three  genomes  in  history. 

Dr.  Venter  recently  announced  that  he  signed  a  letter  of  intent  with  Perkin-Elmer  for  the  formation  of  a 
new  genomics  company.   The  strategy  of  this  company  will  be  centered  on  a  plan  to  substantially 
complete  the  sequencing  of  the  human  genome  in  three  years. 

Dr.  Venter  has  published  more  than  150  research  articles  and  is  currently  tied  with  Dr.  Adams  of  TIGR 
as  the  most  cited  scientist  in  biology  and  medicine.   Dr.  Venter  has  received  numerous  awards  and 
honorary  degrees  for  his  pioneering  work  and  has  been  elected  a  Fellow  of  the  American  Association 
for  Microbiology  and  the  AAAS.    Dr.  Venter  received  his  Ph.D.  in  Physiology  and  Pharmacology  from 
the  University  of  California,  San  Diego  in  1975. 

Scientific  papers  published  include: 

Complementary  DNA  Sequencing:    "Expressed  Sequence  Tags"  and  the  Human  Genome  Project.   Science  252, 
1651-1656  (1991) 

Potential  Virulence  Determinants  in  Terminal  Regions  of  Variola  Smallpox  Virus  Genome.   Nature 
366,  748-751  (1993) 

Whole-Genome  Random  Sequencing  and  Assembly  of  Haemophilus  influenzae  Rd.  Science   269, 
496-512  (1995) 

Initial  Assessment  of  Human  Gene  Diversity  and  Expression  Patterns  Based  Upon  52  Million 
Basepairs  of  cDNA  Sequence.  Nature  377  suppl,  3-174   (1995) 

The  Minimal  Gene  Complement  of  Mycoplasma  genitalium.  Science  270,  397-403  (1995) 

Complete  Genome  Sequence  of  the  Methanogenic  Archeon,  Methanococcus  jannaschii.  Science  273, 
1058-1073  (1996) 

The  Complete  Genome  Sequence  of  the  Gastric  Pathogen  Helicobacter  pylori.    Nature  388,  539-547  (1997) 

The  Complete  Genome  Sequence  of  the  Hyperthermophilic,  Sulphate-Reducing  Archaeon  Archaeoglobus 
fulgidus.   Nature  390,  364-370  (1997) 

Genome  Sequence  of  the  Lyme  Disease  Spitochaete,  Borrelia  burgdorferi.  Nature  390,  580-586  (1997) 

Complete  Genome  Sequence  of  Treponema  pallidum,  the  Syphilis  Spirochete,  Science  (submitted). 

5/98 


THE  NCrnVTE  RW  GENOaC  DEEEMCH 
GHANTSnflOrOSlLe  FUMICD  WD  SUBMnTED 


Lilt  (l»lMd:  ttnSI— 


sowcfc^Tnu 


37 


Pwiedof 
PM»cl 


Teul 


RMIED 


Dapanrrwru  of  En«rgy  (Adanie) 
*£yaoJ  S«qLMnc»  Ti^  trwn  Human 
Brain  lor  Qenoma  Mapptnsi* 


0/15/02- 
3/14/03 


OE-FQ05-ftSERB1fi13 


Oopartmanl  ol  EAargy  (ReMc) 
Idwrtdjeatlon  tf  OertM  in  Anctfiymoua 
DNAS 


4/16/03- 
4/14/04 


De-PG02-»3EReiS6e 


AmFAF)  (Hoera) 
'Ezamlnatbn  d  CaOular  -  Viral 
Praletn  Aaaocblloni* 


0/1/03- 


Jchnt  Hopkdm  Uruv«nUy  (tracer] 
"SPOHE  In  GMtrolnLaftlwtal  Canear' 


0/30/03- 

o/8>/os 


S  P60  CAS>B24.<ie 


National  &etwiD*  FoundaUw)  fnalds) 
Inlogrsiton  at  Mol«eutar  Soquenc^ 
wiiti  Speoiman  and  Tszonomlc  Dare  In 
a  Putlkfy-Accasalbla  Dalabaae* 


7/1/04- 

6/30/06 


DeparVn»nt  erf  Energy  (VanteO 
IHHlh-Throuphpui  DNA  Saquwidng 
and  CharacierlZktkjn  ol  Dhwrsa 
MicnJbal  Gttmntmt'  -  REVISED 
No-ctMi  ailancbn  ihrsugn  2/1 4/M 


1 1/15/04-        DE-FC02.»SEMiae2JUXn  3.362. 52S 

2/1  a/OS 


08part>T^on1  ol  &>vgf  (VA"1sr) 
"Mycoplaama  9«nJiafium* 


11/16/04-        DC-FCOS-OSEReteeZJUMl  100.0DO 

1V14/0S 


D*pBrtni*nl  of  EnofOy  (Vaniar) 
*Hl9h-Thraughput  DMA  Sequencing 
and  Ct<«ractarfxatlon  of  OiverM 
MIcrabial  QoncMr<et*  -  Supplemeni  to 
Y#tr  3  from  abova 


11/15/04.        D&PC02-96ER610eZJUK>4  1,000.000 

tt/14/OT 


Def>vtm«ni  al  Energy  (Vamw) 
'Whole  Ga^wftie  S«qu«nclng  of 
Painoooccua  raiflodurane' 
SUPPLEMENT  to  atDM 


SMC/B6-         OE'FCD8-«5EASieeaA0O3  2.265.476 

11/14/07 


SmkMCIirM  BMohain  Pharrnacaulkuto 

(Vaniw) 

■AlZhaimar%  Ofaoata  RasMich' 


9/21 /•$- 
3/20/08 


DriHSlNIH  (Klitiian) 
'CharactarlzBlkm  of  a  nova) 
GABA-A  raoafior  utwiti;  pf 


1/31/36 
12/ai/OO 


I  n29  N834702-01 


'Ganaik  Or^Antzaikm  ol 
Trvponama  PafldWh* 


3/1/06 
2/26/00 


1    R01   A1 403004)1 


P»9a1 


38 


TNC  MSrmnC  rcn  GENOMC  RESEARCH 
OfUffr&^ltOK>S«LS  RMOED  ANDSUBMnrS) 


t  RavlMtf:   e«nS/M 


Soufe«/TIIU 


Pwiedar 


SUNY  (Rmw) 

*Phycic«l  Mapping  of  ScNMcsoim  Mvuont 

ChroAosomW 

(Suboonlract) 


1U1/97. 

4/30/ge 


0«pa«vnem  of  Navy  (Tgmb) 
'Genetic  Regutalion  ir  itis 
AipiMia  polNa  SymbiosiK* 
RB^gPBUOGCT 


3/1/96- 

2/2e/se 


N0001 4-96-1  -0604 


Tha  a.  HsfoM  and  Lata  Y  Malhort 
Char1labl«  Foundation  (Fraaor) 


3/15/96- 

3/14/97 


DHHSMIH  (Adanw) 
*S«|uanoing  ot  Oiromoaatrie 
iep'04/1 1/06-03^31/99 


4/11/96- 

3/11 /ag 


1  noi  HG0146«^1 


DHHSjNIH  (Adarr*] 
'Saquendns  ol  ChronMaom« 
I6p  SupplsmanI 


4/01/AS-  3  R01  HG01«e*4t28l  1.44a.30S 

8/30/9a< 


OHHS>N(M  (Adam) 
'Saquanctng  ot  ChromowNno 
i6p  Supplemoni 

0HHS/N1H  {Floiachmann) 
*Comp4flia  0«noma  S«qu«nea 
ol  Uyeotuciartum  luDarculotk* 
REVISED  BUOQET  irc.  UsUru  st^p 


e/10/08-  3  ROl  HGai«6«-0262  452 

6/30/96- 


eM5/96*  5  ROl   A1«013S'02  3.267,763 

5/31/39 


OHHS/NIH  (FMscKmann) 
*Compksf«  Qanoina  Saquanca 
of  UycotMOarlum  tubarcubsls* 
(Ann  On&rool  Supplam«nl) 


6/1/97-  6  ROl  Al  40125-Qe 

s/31/ae 


N9F  (Vemw) 

'Arabdepatt  Qonorrifl  eequancing 
Oslnc)  Randan  Shotgun  Saquanclng 
ol  &AC  Gona'-REVISEO 


09/01/96-  DBI-Oe320aS  3.669.391 

6/30/99 


OOE  rVamivr) 

'A/abidopBtc  Q8nerr>«  Saquancinfl 
UaIng  Random  Stwigun  Saqiisnevig 
or  BAG  Oonat'-REVISED 


09/0 1  /96-       0E-FQ02-96ER20249AOO1 
6/31/96 


Da^anmanl  ol  Enar^  SuboonLraot  (Adamc) 
*ConainjcTlon  or  a  ganome-wkle, 
rwghV  cftB(act0rlc«d  dona  r«aoura« 
lor  9»noine  aaquorMino.* 
REVBED  BUDGET 


9MS/66. 
D/14/07 


coniraci#386460 


OHMS/NIH  (Clayton) 

'Whole  G*rx>me  SvquancKig  erf 

VUwiQ  choterM'  -  REVISED 


T2/1/06- 
11/30/96 


1    ROl   AiaOSaVOI 


DMH8MH  -  (Kelchuml 
'Complata  Ganoma  Sequenoa  ol 
E/iieraecKcua  (aeaaUs  '  C^OA  93.666 


5/1/97 
4/30/09 


1    ROl    A1408e341 


PW*8 


39 


iMiHRnifiiFoac 


Ttuff^nfr . 


-Eakft- 


-aasii- 


m  iso.eoo 


•  atUo 


No-cnl  «aMMlBn  iftni  I/SI/M 


O0/3O/»7-O»/3»/OO 


(/3i/»a 


7  KDt  Haoeosr-M 


•ia«»7>  1  Ml  CA770W«I 


CmSMMfta*) 

■RMt  MMqr-  CRM  aaias4 


•ajanatel  a  .  Rg  CTT  (MIT)  CMilll 
0»/aD/gT-<»/lRW> 


t/ai/w 


I  RDI  Mjnsi-oi 


DOKKaOMi) 

iilwrwh   lililiii^    pmnnmt  6t  Dw 
prvlOTt  ATPm*  In  m«vbrvw  petMW 
ngiiMn- RfvSED  RUOQET 


f/1S/t7. 
■/14/M 


DE-noz-aTVOcoa 


BimugM  WMalM  ruid  dUxtiar) 
Xon^lrti  nucUatdi  aa^MnM  d 


7;n/tT- 

7/1 •>•■ 


N8F(KaAjB) 
I 

■  •I 


■/1/*T- 
7/91/»t 


OOC(VM«) 


OB/OI/CT.flC/31/M 


DtWSiMM  (T«k) 


PiofMi* 

a>«i«7-ot/»iw  flESUtHrnec 


1/1  t/tT.         DE-ngt-cTCwMu  ••i.4 


•ft/*7.  I  RSI  DEiaoa-oiAi  i,»i.«$i 

•/$i/«» 


OWtSMH  (TWidimMt 


•/I»7. 
•/ai/M 


|RD1*W1M»«1 


a/iiw*-         et-f  eea  iTtwaoo 

IWMfM 


i.M*.a«* 


Pi«»j 


1HE  •STTrUTE  ran  aSOUC  HESEMCH 
OiUinsiPfKlPOSilLS  RINDQI  ANO  SUIHinB} 


Lmi  RmlMd:  0C/1HM 


40 


_Pfafes!_ 


D«Qartm«nl  of  Oatons*  (Ver>t«r) 
'Mslarte  Qanoma  S«qu«icino  PrajMt* 


12/17/07-  ERM5«gr222700a 

1 8/16/02 


*Cgrnputatton*l  TachnlqwM  for  G«nom|c  Analycw* 


01/01 /Qt* 
C/30/90 


7  tcoi  HQoootz-o*  zoo.ota 


Mwdc  GMKinw  RacMfCfi  IntlHutv  (OUT) 

*  Gcnoma  An«)y«lt  gf  Swphyiococcia  UJfma* 

(RevlMd  Bud0»() 

(RevlMd  auil9«l  daivd  oa/OO/M) 

Marck  C^anome  fl«««arch  kwiliulv  (3milh) 
'CompUstlarial  An«ly«|«  of  Irnargwwc 
Hagb»«  m  Ukirobtal  Ovnoms' 
(n>yk»d  BudQI) 


03/01 /B8- 

2/39/00 


04/01 '8B* 
1/31/00 


OS/01/9S' 
2/2a/06 


MCm  Prapaal  ff73  T 1 9.000 


1   R01  AU3fi67^i  1.305,947 


ManiPra(iHrtff74  118.643 


DOe(Vent40 

'At  trtegmad  Program  in  MtcroUal 

Ganoma  Soquandng  and  Afwiysh' 


02/1 6/08- 
a/14/Ot 


13,500.000 


D»«4S/NIH  (Ad«JM) 

'African  Tiypanetoma  Gvnoina 

Saquandng* 


04/01/00- 
3/31/01 


1   R01  AI43062-01  3.306.637 


'Svquanoa  Analyvla  aH  PlMtnodm 
Uldparum  cKrarTKMomaft  0  v4  lOr 


04/01/00- 
3/31/01 


1  not  Ali2243.01  2,9S4.034 


D»«4a/NM  tf^4Mr)  06/01/00' 

•Qmnom*  AnaV^  of  CMiinvdUl  apadcs*  4/30/00 

Char«g«J  acclpnmwil  twin  1  ROi  HOOITM^ 


1  ROI  A»430»O1  700,711 


Tom)  Fw4«d  Gfaftt  to  Data 


"-^^■'■^» 


41 


TMi  Hnmnv  PCM  a 


tf-wn^ 


0>*amH(fmmt) 


wi'*7-ii/io/eo 


ianit7- 

1l/30f00 


I.O».*S« 


DO€'  (Vamv) 


01/01 /«•• 

i;3i'oi 


PiigaS 


42 

Chairman  Calvert.  Dr.  Galas. 

TESTIMONY  OF  DAVID  J.  GALAS,  PRESIDENT  AND  CHIEF 
SCIENTIFIC  OFFICER,  CHIROSCIENCE  R&D  INC.,  BOTHELL,  WA 

Mr.  Galas.  Mr.  Chairman  and  Mr.  Roemer,  I  certainly  welcome 
the  opportunity  to  testify  before  the  Committee  concerning  the  fu- 
ture of  a  project  so  central  to  the  future  of,  not  only  the  biological 
sciences  but  the  biotechnology  and  health  care  industries  of  the 
United  States,  and  it  is  a  pleasure  to  be  here  with  such  a  distin- 
guished group. 

This  is,  as  is  evident,  a  critical  time  for  this  historic  project  and 
the  attention  of  Congress,  the  private  sector,  and  the  public  sector, 
and  all  of  the  scientific  community,  is  certainly  called  for  to  ensure 
that  we  make  the  most  of  our  opportunity  here,  the  opportunity  to 
advance  the  scientific  foundations  of  these  areas  that  are  so  impor- 
tant to  the  health  nation. 

Now  having  worked  in  academia,  as  well  as  the  private  sector, 
I  have  witnessed  firsthand  the  effect  it  has  already  had  on  research 
in  the  public  and  private  sectors  and  several  of  the  previous  wit- 
nesses have  cited  these.  It's  become  a  cliche  to  call  these  effects 
revolutionary  and  I'm  not  going  to  add  to  any  of  these  cliches,  but 
let  me  just  point  out  that  in  this  case,  almost  all  of  these  cliches 
have  been  quite  accurate. 

So  why  is  the  Human  Genome  Project  so  important  and  when 
one  summarize  this,  what  is  this  revolution  about?  Well,  I'd  say  it's 
simply  about  scientists,  wherever  they  are  in  the  life  sciences,  hav- 
ing the  fundamental  data  close  at  hand  about  the  information  in 
the  human  genomes,  the  genes  and  regulatory  elements,  so  that 
they  can  enable  their  research  into  fundamental  disease  mecha- 
nisms, diagnostics,  therapeutics,  and  other  fundamental  biological 
mechanisms  to  an  extent  never  seen  before. 

Now  this  genetic  information  is  particularly  important  to  the  pri- 
vate sector  which  is  devoted  to  discovering  and  developing  new 
therapeutic  drugs,  among  other  things.  A  great  deal  of  money  and 
time  is  now  spent  in  publicly-supported  laboratories  and  in  private 
companies  across  the  world  acquiring  genomic  information, 
genomic  sequence  information  piecemeal  as  it  is  needed.  For  exam- 
ple, the  availability  of  the  full  sequence  of  the  human  genome,  even 
a  rough  version  thereof,  this  past  year  would  have  saved  our  small 
biotechnology  company  about,  I  estimate,  about  $1.5  million  in  di- 
rect costs  and  countless  months  of  time  on  each  of  several  projects. 
Our  work  in  discovering  therapeutics  for  autoimmune  disease, 
osteoporosis,  and  other  diseases  is  still  a  small  corner  of  the  bio- 
medical research  spectrum,  and  so  these  costs  to  us  need  to  be  mul- 
tiplied by  the  relative  size  and  number  of  all  involved  biotechnology 
and  pharmaceutical  companies  in  this  country  to  see  what  the  di- 
rect cost  impact  on  biomedical  research  would  be.  Now  the  indirect 
costs  are  also  great,  as  will  be  the  impact  on  publicly  funded  re- 
search of  all  kinds.  It  all  adds  up  to  a  very  large  potential  savings 
and  some  very  rough  calculations  that  I  made  suggest  that,  per- 
haps, a  year  advance  in  the  availability  of  this  information,  say  in 
the  next  year,  for  purposes  of  argument,  would  probably  save  some- 
thing like  $2  billion  in  funding  in  the  private  sector  and,  I  think, 
that's  quite  a  conservative  estimate. 
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So  the  discovery  of  therapeutics,  of  course,  is  not  only  about 
money.  The  savings  that  arise  from  better,  more  effective  therapies, 
and  diagnostics  that  come  sooner  to  the  public,  and  I  emphasize 
the  word  sooner,  must  also  be  a  major  consideration.  The  need  for 
widely-available  public  data  resource  containing  the  full  com- 
plement of  human  sequence  information  has  never  been  greater. 
The  announcement  by  Dr.  Venter  and  his  colleagues  that  they  are 
forming  this  new  enterprise  to  generate  vase  amounts  of  human  se- 
quence brings  us  here  today  and  this  project,  I'd  like  to  just  make 
a  few  comments  on.  This  is  a  most  ambitious  project,  of  course,  re- 
quiring a  large  number  of  new  things,  new  automated  machines, 
new  computational  methods,  new  significant  data  production  orga- 
nization, but  a  relatively  small  group.  It's  a  difficult  undertaking, 
but  as  you  see,  and  as  you  have  responded,  it  is  galvanizing,  a  gal- 
vanizing prospect  to  the  entire  community. 

Now  while  I  cannot  directly  assess  the  new  technical  advances 
that  are  cited  in  their  announcement,  to  me  the  claims  are  quite 
credible  and  most  welcome.  And  judging  from  my  familiarity  with 
the  field,  are  probably  within  reach.  The  scientists  involved  are  ex- 
perienced, serious,  and  careful  and  the  prospect  of  doing  what  is 
planned  is  certainly  within  what  I  view  as  technically  feasible  and 
certainly  not  fanciful.  While  there  will  always  be  debates  about 
how  new  approaches  will  work  and  about  the  technical  details,  and 
these  will  change,  there's  no  question,  from  month  to  month  as  we 
go  forward,  I  would  say  in  summary  that  their  proposal  seems  to 
be  well-founded  and  plausible. 

Now,  obviously,  the  first  judgment  on  their  success  or  failure  is 
going  to  depend  on,  on  their  resolve,  their  resource  commitment, 
and,  finally,  on  awaiting  real  results,  but  it  seems  to  me  they  have 
an  excellent  chance  of  succeeding  and  achieving  their  most  impor- 
tant goals.  So  it  is  notable  and  very  welcome  in  addition  that  the 
community  effort  is  going  to  be  treated  to  the  availability  of  the 
vast  amounts  of  this  information  as  the  project  goes  forward,  ac- 
cording to  their  announcement. 

In  reaction  to  that  I'd  say  it's  essential  that  the  community  and 
the  leadership  of  the  genome  project  take  these  prospects  very  seri- 
ously and  work  both  to  reform  or  restrategize  about  the  human  ge- 
nome project  strategy,  anticipating  access  to  this  new  data,  and  to 
forge  close  links  to  the  private  sector,  both  sentiments  have  already 
been  described  by  the  leadership  of  the  project. 

So  let  me  just  say  in  emphasis,  I  do  not  believe  that  it  is  sen- 
sible, however,  for  the  federally-supported  program  either  to  con- 
tinue absolutely  unchanged  with  the  strategy  currently  in  effect, 
nor  to  reduce  the  level  of  their  efforts.  Both  of  those  are  very  im- 
portant and  I  think  it's  clear  from  the  response  so  far  that  at  least 
this  general  view  is  shared  by  both  the  DOE  and  the  NIH.  It  seems 
that  the  prospect  of  the  private  sector  sequencing  effort  has  served 
as  quite  a  useful  stimulus  to  refocusing  the  Federal  effort  or  at 
least  having  a  look  at  the  strategy.  And  I'm  sure  Dr.  Olson  will 
comment  on  some  of  these.  In  my  view  the,  changing  the  strategy 
slightly  will  be  very  effective  and  now  let  me  explain  what  I  mean 
by  that  in  very,  in  just  a  few,  a  few  words. 

Initially,  what's  most  important  in  the  genome  is  the  location 
and  structure  of  the  functional  components,  the  genes  and  the  con- 
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trol  elements.  Next  most  important  is  the  variations  that  occur  in 
these  components,  in  these  component  parts,  and  how  they  occur 
in  the  human  population,  and  the  fundamental  biological  effect  on 
the,  on  individuals  that  carry  those  variations. 

Now  it  is  going  to  be  the  research,  the  research  work  of  many 
decades  to  understand  the  basic  biological  and  health  effects  of 
these  variations.  But  in  achieving  the  initial  goal,  the  first  of  these, 
getting  the  fundamental  understanding  information  about  the 
genes  and  their  and  their  control  elements,  I  would  argue  that  it 
should  be  the  first  new  goal  of  the  human  genome  project  to  focus 
its  attention  on  getting  the  first  characterization  of  the  genome  se- 
quence as  quickly  as  possible.  It's  been  characterized  as  a  first 
draft,  that  may  be  considered  to  be  a  pejorative,  but  I  think  what 
we  really  need  is  to  get  that  information  out  as  soon  as  possible 
and  I  think  plans  are  under  way  that  could  well  put  this  together. 

Now  reaching  this  goal  in  conjunction  with  the  private  effort 
would  enable  the  human  genome  project  to  succeed  more  rapidly 
than  ever,  but  I  think  even  without  that,  it's  the  right  thing  to  do, 
to  reorient  towards  getting  a  rapid  release  of  something  that  some, 
some  call  a  first  draft  or  an  intermediate  draft.  So  this  strategy, 
I  think,  makes  a  great  deal  of  sense  and  let  me  just  summarize  the 
arguments  that  I'm  putting  forward  for  that. 

No.  1  is  speed.  Speed  is  absolutely  critical  to  the  private  sector 
and  the  public  sector.  The  second  one  is  that  it  is  a  major  benefit, 
every  piece  of  new  information  is  a  major  benefit  to  the  biomedical 
research  community.  Third,  an  effective  and  positive  response  to 
the  private  sector  proposal  is  also  gained  by  adopting  this  sort  of 
a  strategy.  And,  finally,  future  technical  effectiveness,  I  think  there 
are  many  technical  aspects  of  the  revised  strategy  that  stand  to 
provide  significant  advantages  for  future  sequencing  effort  once  the 
details  were  worked  through  as  they  will  be  in  the  next  few  years. 

Reaching  the  first  goal,  however,  should  be  seamless  with  a  fol- 
low-on effort  to  completely  fill  in  the  sequence  draft,  if  you  will,  by 
producing  a  very  accurate,  high  quality,  and  complete  reference  se- 
quence of  the  genome.  This  final  project  of  the  human  genome  pro- 
gram will  then  become  the  single  most  important  database  of 
human  biology,  the  complete  sequence  of  our  genetic  heritage. 

Rather  than  being  redundant,  the  federal  program  is  more  rel- 
evant than  ever,  since  federal  support  should  now  be  able  to 
achieve  more  per  dollar  spent,  and  produce  a  project  quite  different 
from  what  can  be  expected  from  the  private  effort,  if  the  private 
effort  succeeds.  I  would  suggest  that  more  resources  should  be  de- 
voted to  the  sequencing  effort  now  because  the  project  offers  re- 
turns soon  and  the  impact  of  early  acquisition  of  the  information 
will  be  well  worth  it. 

The  prospect  before  us  of  a  highly-cooperative  effort  between 
public  and  private  sectors  is  one  that  I  think  we  should  seize  en- 
thusiastically. Now  the  federal  program  appears  to  be  already  re- 
sponding with  renewed  resolve  to  this  opportunity  by  rethinking 
the  strategies  and  there's  been  a  lot  of  effort,  I  know,  expended  on 
discussing  plans  for  sequencing  programs.  I  applaud  this  resolve 
and  I  expect  the  genome  community  at  large,  both  public  and  pri- 
vate will  recognize  the  critical  nature  of  this  moment  and  seize  the 
opportunity  to  make  the  most  of  it. 
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This  completes  my  prepared  remarks  and  I'd  be  happy  to  answer 
any  questions. 
[The  prepared  statement  and  attachments  of  Mr.  Galas  follow:] 
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Galas,  10  June  1998 
Mr.  Chairman  and  Members  of  the  Committee: 

I  welcome  the  opportunity  to  testify  before  the  committee  concerning  the  future  of  a  project 
so  central  to  the  future  of  medicine,  the  biological  sciences  and  the  biotechnology  and 
health  care  industries  of  the  United  States,  the  Human  Genome  Project  ("HGP").  This  is  a 
critical  time  in  the  progress  of  this  historic  project  and  the  attention  of  the  congress,  the 
private  sector  and  all  the  scientific  community  is  called  for  to  insure  that  we  make  the  most 
of  this  opportunity  to  advance  the  fundamental  scientific  foundations  of  these  areas  so 
important  to  the  health  of  our  nation. 

I  will  present  here  my  views  on  the  strategic  issues  confronting  the  broader  community 
directly  concerned  with  the  project  and  explain  why  the  impact  on  the  public  and  private 
sectors  will  be  so  fundamental.  I  am  the  President  and  Chief  Scientific  Officer  of  a  small 
biotechnology  company  in  Seattle,  Washington.  Having  worked  in  academia,  as  well  as 
the  private  sector,  I  have  participated  in  the  revolutionary  changes  in  the  biomedical 
sciences  engendered  by  the  explosive  accumulation  of  genetic  data  and  of  DNA  sequence 
information,  and  have  wimessed,  first  hand,  the  effect  it  has  already  had  on  the  conduct  of 
research  in  the  pubhc  and  private  sectors.  It  has  become  almost  a  clich6  to  call  these 
effects  revolutionary,  but  in  this  case  the  cliche  is  accurate.  I  have  served  in  government, 
and  I  am  proud  to  have  been  in  the  position  of  responsibiUty  in  DOE  now  occupied  by  Dr. 
Patrinos  at  the  official  launch  of  the  Human  Genome  Project  in  1990  by  DOE  and  NIH. 

Why  is  the  HGP  so  important  and  what  is  this  revolution  about?  It  is  simply  about 
scientists  and  researchers  having  close  at  hand  the  fundamental  data  about  the  layout  and 
information  content  of  all  the  human  genome,  genes  and  regulatory  elements.  This  enables 
research  into  fundamental  disease  mechanisms,  diagnostics  and  therapeutics  to  an  extent 
never  seen  before.      Therefore,  this  genetic  information  is  particularly  important  to  the 


48 


Galas,  10  June  1998 

private  sector  devoted  to  discovering  and  developing  new  therapeutic  drugs.  A  great  deal 
of  money  and  time  is  now  spent  in  private  companies  across  the  world  acquiring  genomic 
information  piecemeal,  as  it  is  needed.  For  example,  the  availabihty  of  the  full  sequence  of 
the  human  genome  this  past  year  would  have  saved  our  small  biotechnology  company  $1.5 
million  alone  in  research  costs  directly  expended  on  sequencing  new  regions  of  the  genome 
and  countless  months  of  time  on  each  of  several  projects.  Our  work  towards  discovering 
therapeutics  for  autoimmune  disease,  osteoporosis  and  other  diseases  is  still  a  small  comer 
of  the  biomedical  research  spectrum.  These  costs  to  us  need  to  be  multiplied  by  the  relative 
size  and  number  of  all  the  involved  biotechnology  and  pharmaceutical  companies  in  this 
country  to  see  the  direct  cost  impact  on  biomedical  research  -  the  indirect  effects  will  also 
be  numerous  and  impressive.  It  adds  up  to  a  very  large  potential  savings,  and  all  of  these 
needs  will  continue  to  increase  as  research  advances.  In  addition,  the  biomedical  research 
funded  by  the  federal  government  will  also  be  enabled  and  accelerated  by  this  information. 
Therefore,  the  cost  savings  to  the  public  and  private  sectors,  in  time  and  money  alone,  will 
be  enormous.  However,  the  discovery  of  new  therapeutics  is  not  only  about  money. 
Savings  of  another  kind,  that  which  arises  from  better,  more  effective  therapies  and 
diagnostics  coming  sooner  to  the  public,  must  also  be  a  major  consideration.  The  need  for 
a  widely  available,  public  data  resource  containing  the  full  con^lement  of  human  sequence 
information  has  never  been  greater. 

What  brings  us  here  today  is  the  announcement  by  Dr.  Venter  and  his  colleagues  (PE- 
TIGR)  that  they  are  forming  a  new  enterprise  to  generate  vast  amounts  of  sequence  data  on 
the  human  genome  in  a  few  short  years.  This  is  a  most  ambitious  project,  requiring  a  large 
number  of  new  automated  machines,  new  computational  methods,  a  significant  data 
production  organization  and  new  infrastructure.  It  is  a  galvanizing  prospect  to  the  entire 
community.  While  I  am  not  in  a  position  directly  to  assess  the  new  technical  advances  that 
are  cited  in  their  announcement,  the  claims  are  both  credible  in  detail  and  most  welcome  and 
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judging  from  my  familiarity  within  the  field,  are  probably  well  within  reach.  The  scientists 
involved  are  experienced,  serious  and  careful  and  the  prospect  of  doing  what  is  planned  is 
certainly  within  what  I  view  as  technically  feasible  and  certainly  not  fanciful.  While  there 
will  always  be  debates  about  whether  and  how  new  approaches  will  work  and  about  the 
technical  details,  their  proposal  appears  to  be  well  founded  and  plausible.  Final  judgment 
on  their  success  or  failure  will  depend  on  the  resolve  and  resource  commitment  of  the 
principals  and  must,  of  course,  await  the  first  real  results,  but  it  seems  likely  to  me  that  they 
stand  a  good  chance  of  succeeding  in  achieving  their  most  important  stated  goals.  It  is 
notable  and  very  welcome  to  the  entire  community  that  the  PE-TIGR  effort  has  made 
commitment  to  sharing  sequence  data  with  the  public  HGP. 

It  is  essential  that  the  community  and  the  leadership  of  the  genome  project  take  these 
prospects  very  seriously  and  work  both  to  reform  the  HGP's  strategy  anticipating  access  to 
this  new  data  and  to  forge  close  links  to  the  private  sector  effort.  As  I  will  argue  below,  I 
do  not  believe  that  it  is  sensible  for  the  federally  supported  project  either  to  continue 
unchanged  with  the  strategy  currently  in  effect,  or  to  reduce  the  level  of  their  efforts.  I 
think  it  is  clear  from  the  response  thus  far  that  this  general  view  is  shared  by  the  DOE  and 
NIH  alike.  They  appear  to  be  responding  with  an  eminently  sensible  attempt  at  revision  of 
the  strategy  for  sequencing  and  a  commitment  to  take  advantage  of  whatever  new 
sequencing  c£^acity  and  data  release  comes  from  the  private  effort.  It  seems  that  the 
prospect  of  the  private  sector  sequencing  effort  has  served  as  a  beneficial  stimulus  to 
refocus  the  federal  effort  on  a  strategy  that  will,  in  my  view,  maximize  the  effectiveness  of 
the  project  whether  or  not  the  private  effort  reaches  their  stated  goals.  If  they  do  reach 
these  goals  the  strategy  will  greatly  advance  the  rate  of  accumulation  of  useful  data  and 
hasten  the  day  of  the  first  completion  of  the  sequence  of  the  human  genome. 
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Initially,  what  is  most  important  in  the  genome  is  the  location  and  structure  of  the  fiinctional 
components  -  the  genes  and  their  control  elements.  Next  important  is  the  variations  that 
occur  in  these  component  parts  in  the  human  population  and  the  fundamental  biological 
effects  on  the  individuals  that  carry  these  variations.  It  is  these  variations  that  make  each 
of  us  distinct  in  our  good  health  and  strengths,  and  our  susceptibility  to  disease  and  ill 
health.  It  will  be  the  research  work  of  many  decades  to  understand  the  extent  and  the 
basic  biological  and  health  effects  of  these  variations  -  this  work  will  be  a  large  part  of  the 
future  of  medical  research. 

The  initial  goal  of  the  HGP  sequencing  effort  is  to  provide  the  initial  blueprint,  the  basic 
sequence,  not  the  myriad  of  sequence  variations.  While  many  basic  researchers  and 
companies  alike,  us  included,  are  focused  on  detecting  and  understanding  consequences  of 
these  many  small  variations  in  the  human  genome,  called  single  nucleotide  polymorphisms 
or  SNPs,  we  aU  need  the  initial  sequence  to  progress  this  next  wave  of  biomedical 
research.  Therefore,  I  argue  that  it  should  be  the  essential  primary  goal  of  the  HGP  to 
focus  its  attention  on  how  to  arrive  at  the  first  initial  characterization  of  the  genome 
sequence  as  quickly  as  possible,  whether  or  not  the  private  effort  contributes  in  the  long 
run  to  reaching  this  goal.  Reaching  this  goal  in  conjunction  with  the  private  effort, 
however,  would  enable  the  HGP  to  succeed  more  rapidly  than  ever,  but  even  without  the 
impetus  of  the  prospect  of  the  private  effort  the  HGP  should  be  re-oriented  to  this  primary 
goal  -  to  obtain  an  initial  "first  draft"  of  the  human  genome  as  soon  as  possible.  Even  a 
rough  "first  draft"  would  be  absolutely  invaluable  to  the  broad  biomedical  community.  It 
appears  that  the  prospect  that  brings  us  here  today  has  galvanized  the  HGP  into  considering 
a  strategy  like  this  in  any  case  and  one  that  could,  with  public-private  cooperation,  lead  to  a 
much  more  rapid  achievement  of  this  initial  goal.   This  strategy  makes  sense. 

To  summarize  the  arguments  for  a  refocused  HGP  strategy: 
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1 .  Speed.  The  critical  information  will  be  available  sooner,  probably  95%  within  3 
years. 

2 .  A  major  benefit  to  biomedical  research.  The  benefits  of  locating  genes  and  control 
elements  sooner  will  substantially  advance  all  biomedical  research  sectors. 

3 .  An  effective  and  positive  response  to  the  PE-TIGR  proposal.  Refocus  of  the  HGP 
strategy  takes  advantage  of  the  opportunity  to  leverage  the  private  sector  investment  into  a 
valuable  public  resource. 

4 .  Future  technical  effectiveness.  There  arc  many  technical  arguments  for  the  revised 
strategy  that  stand  to  provide  advantages  for  future  sequencing  efforts  once  the  details  are 
worked  through. 

The  achievement  of  the  initial  goal  of  a  "first  draft"  should  in  no  way  mark  the  end  of  the 
project.  It  is  important  that  the  reaching  of  the  first  goal  be  seamless  with  a  continuing, 
follow-on  effort  to  complete  the  sequence  "draft"  by  producing  a  very  accurate,  high- 
quality,  complete  reference  sequence  of  the  genome.  Finishing  this  final  product  is  just  as 
important  as  the  initial  goal  and  will  be  easier  and  less  expensive  than  it  is  now.  This  final 
product  of  the  HGP  will  then  become  the  single  most  important  database  of  human 
biology,  the  complete  sequence  of  our  genetic  heritage. 

Rather  than  being  redundant,  the  federal  HGP  is  more  relevant  than  ever,  since  federal 
support  should  now  be  able  to  achieve  more  per  dollar  spent,  and  produce  a  product  quite 
different  from  what  can  be  expected  from  the  private  effort.  I  suggest  that  the  early 
prospect  of  completion  that  arises  from  the  private  proposal  should  be  met  with  increased 
funding  for  the  federal  project,  subject  to  successful  completion  of  the  new  planning  effort 
that  is  underway.  The  changes  should  not,  however,  end  there.  The  prospect  before  us  of 
a  strong,  highly  cooperative  effort  between  the  public  and  private  sectors  is  one  that  we 
should  seize  enthusiastically.     Public-private  sector  cooperation  too  often  is  afflicted  with 
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bureaucratic  viscosity,  management  difficulties  and  basic  problems  in  reaching  the  stated 
goals.  To  my  view,  this  opportunity  appears  to  be  one  that  will  lend  itself  well  to  avoiding 
these  pitfalls.  The  benefits  to  both  sides  and  to  the  pubUc  at  large,  of  a  successful  endeavor 
are  indeed  great  and  the  commitments  and  progress  will  be  visible  and  accountable  in  large 
measure  by  both  sides. 

The  federal  program  appears  to  be  already  responding  with  renewed  resolve  to  this 
opportunity  by  rethinking  the  strategy  and  replanning  the  sequencing  programs  and  I  expect 
the  genome  community  at  large,  both  public  and  private,  will  recognize  the  critical  nature  of 
this  moment  and  seize  the  opportunity  to  make  the  most  of  it. 

This  completes  my  prepared  testimony.  I  would  be  happy  to  answer  any  questions. 
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Chairman  Calvert.  Thank  you,  Doctor. 
Doctor  Olson. 

TESTIMONY  OF  MAYNARD  V.  OLSON,  PROFESSOR  OF  MEDICAL 
GENETICS  AND  GENETICS,  DEPARTMENT  OF  MOLECULAR 
BIOTECHNOLOGY,  AND  DIRECTOR,  GENOME  CENTER,  UNI- 
VERSITY OF  WASfflNGTON,  SEATTLE,  WA 

Mr.  Olson.  Thank  you,  Mr.  Chairman.  I'm  here  to  provide  the 
perspective  of  an  academic  researcher  who  has  been  involved  in 
what  is  now  called  genome  analysis  for  over  20  years.  Indeed,  my 
involvement  dates  to  a  time  when  the  term  genome  was  rarely 
used,  even  in  scientific  circles,  and  had  yet  to  have  any  impact 
whatsoever  on  public  discourse.  Since  then,  of  course,  the  times 
have  changed  as  this  hearing  and  the  intensive  press  coverage  of 
the  Perkin-Elmer  announcement  indicate.  They've  changed,  per- 
haps, foremost  because  the  singular  historical  opportunity  that  we 
now  face  to  unravel  the  molecular  details  of  how  the  information 
is  stored  and  what  the  information  is  that  glides  the  trans- 
formation of  a  fertilized  egg  into  a  fully-developed  human  being  has 
caught  both  the  popular  and  the  scientific  imagination. 

More  practically,  and,  perhaps,  more  forcefully  in  the  short  run, 
times  have  changed  as  the  immediate  value  of  the  data  produced 
by  genome  analysis  has  become  evident,  particularly  the  value  of 
DNA  sequence  data.  These  data  have  a  high  scientific  value  and 
also  a  high  value  in  dollars,  yen,  and  Euros.  Thus,  entering  a  major 
participation  of  the  commercial,  injecting  a  major  participation  of 
the  commercial  sector  into  what  had  previously  been  predomi- 
nantly a  basic  science  initiative. 

Congress  now  faces  a  new  challenge  of  understanding  and  re- 
sponding to  a  scientific  environment  in  the  human  genome  project 
that  has  all  of  chaos  that  comes  with  scientific  and  policy  success. 
My  basic  message  in  this  turbulent  environment  if  quite  system 
and  that  is  that  the  system  is  working.  It  is  important  to  keep  in 
mind  that  biomedical  research  in  the  United  States  derives  its  for- 
midable strength  from  the  synergy  between  three  sectors,  the  bio- 
technology industry,  the  more  traditional  pharmaceutical  industry, 
and  academic  and  publicly-supported  research.  All  of  these  sectors 
are  scrambling  in  their  own  ways  to  adjust  to  our  sudden  ability 
to  produce  DNA  sequence  on  a  large  scale.  In  this  context  the 
Perkin-Elmer  announcement  is  a  bold  example  of  the  response  of 
the  biotech  sector  to  these  opportunities. 

Perkin-Elmer  is  adopting  here  an  overtly  biotech  style  of  oper- 
ation despite  its  roots  as  a  manufacturer  of  scientific  instruments 
and  reagents.  It's  a  hallmark  of  the  biotech  style  that  time  is  of  the 
essence  and  publicity  is  a  key  tool  for  influencing  events.  Those  of 
who  are  watching  this  spectacle  from  the  sidelines  should  certainly 
wish  Perkin-Elmer  well.  The  company's  investment  will  surely  lead 
to  faster  testing  of  new  reagents  and  instrumentation  and  also  will 
produce  much  data  that  will  be  of  both  commercial  value  and  basic 
scientific  interest. 

However,  the  excitement  generated  by  the  well-orchestrated  pub- 
lic relations  campaign  surrounding  this  announcement  should  not 
disguise  that  what  we  have  at  the  moment  is  neither  new  tech- 
nology nor  even  new  scientific  activity.  What  we  have  is  a  press  re- 
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lease.  And  I  believe  when  I  speak  for  many  academic  spectators 
when  I  say  I  look  forward  to  a  transition  from  plans  to  reality.  In 
short,  show  me  the  data. 

I  cannot  emphasize  too  strongly  that  science  by  press  release 
and,  worse  yet,  science  policy  by  press  release  is  not  a  path  that 
the  United  States  Congress  or  the  federal  agencies  wants  to  walk 
down.  I  believe  that  the  overwhelming  risk  for  the  publicly-funded 
program  is  one  of  overreaction.  What  the  Perkin-Elmer  initiative 
offers  with  the  greatest  probably  is  that  the  immediate  needs  of  the 
biological  community  during  a  period  of  a  few  years,  roughly  in  the 
interval  2000  to  2003  may  be  better  met  than  would  otherwise 
have  been  the  case.  And  I  hope  that  the  project  is  successful  and 
that  the  data  are  sufficiently  accessible  to  the  scientific  community 
that  this  promise  is  met. 

However,  in  the  larger  scheme  of  the  Human  Genome  Project,  we 
would  all  be  unwise  to  focus  on  so  transient  the  contribution.  The 
case  for  the  transience  of  these  data's  value  lies  in  one's  assess- 
ment, in  advance,  of  any  real  basis  to  make  such  a  judgment  of  the 
likely  quality  of  the  final  product,  as  has  mentioned  repeatedly  by 
others  at  the  table  and  will  be  a  subject  of  intensive  technical  dis- 
cussion for  some  years  to  come. 

I,  frankly,  am  a  skeptic  that  the  approaches  as  publicly  described 
will  lead  to  a  product  of  sufficient  quality  to  meet  the  long-term 
needs  of  the  scientific  community.  I'm  prepared  to  be  proven  wrong, 
as  any  scientist  must  be,  but  I  am  comfortable  predicting  that  this 
approach,  as  the  downside  of  its  efficiency,  will  encounter  reason- 
ably catastrophic  problems  at  the  stage  of  which  the  tens  of  mil- 
lions of  independent  sequencing  tracks  need  to  be  melded  together 
to  produce  a  composite  view  of  the  human  genome. 

To  be  specific,  I'm  comfortable  predicting  that  there  will  be  over 
100,000  serious  gaps  in  the  final  product  and  in  this  context,  I  de- 
fine a  serious  gap  as  one  in  which  there  is  uncertainty  even  as  how 
one  should  orient  and  align  the  islands  of  assembled  sequence  be- 
tween the  gaps.  Furthermore,  I'll  predict  that  a  substantial  frac- 
tion, particularly  the  smaller  islands  of  sequence  of  produce  will  be 
misassembled,  that  is  they  will  not  actually  correspond  to  the  orga- 
nization of  the  human  genome  and  I  say  these  things  being  thor- 
oughly familiar,  and  admiring,  TIGR's  success  in  sequencing  bac- 
terial genomes  by  what  superficially  would  appear  to  be  a  similar 
strategy. 

I  want  to  emphasize  that  even  such  data  will  certainly  have  con- 
siderable biological  utility  and  it  may  prove  to  be  a  major  help  in 
the  final  push  toward  a  high  quality  human  sequence,  although  I 
would  also  emphasize  that  this  prospect  is  somewhat  less  certain. 
Experience  has  tended  to  show  that  large  amounts  of  low-quality 
sequence  data  are  a  poor  substitute  for  smaller  amounts  of  high- 
quality  data  collected  for  the  specific  purpose  of  assembling  a  con- 
tiguous, accurate  sequence  which  I  believe  should  continue  to  be, 
with  a  minimum  of  distractions,  the  focus  of  the  publicly-funded  ef- 
fort. 

Clearly,  as  time  develops,  if  data  from  this  private  initiative 
proves  to  be  of  clear  utility  in  achieving  that  publicly-financed  goal, 
other  strategies  should,  and  will,  adapt.  I  want  to  emphasize  that 
there  are  two  reasons  to  aim  high  in  terms  of  the  quality  of  the 
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final  human  sequence.  And,  frankly,  I  am  much  more  concerned 
about  the  force  of  these  arguments  than  I  am  about  the  oppor- 
tunity costs,  although  I  acknowledge  there  will  be  opportunity 
costs,  associated  with  relatively  transient  delays  in  the  availability 
of  the  final  product. 

The  two  reasons  have  to  do,  first,  with  deferred  costs  as  a  prac- 
tical reason.  A  human  sequence  that  has  many  deficiencies  will 
defer  for  decades  to  come,  throughout  the  biomedical  research  en- 
terprise, the  need  to  fix  small  problems  as  they  are  encountered  by 
individual  investigators.  The  other  argument  is  perhaps  even  more 
important  in  taking  the  broad  view  of  public  policy  in  this  matter. 
And  that  is  that  all  of  us,  as  we  build  the  total  package  of  activity 
in  the  public  sector,  the  private  sector,  throughout  biomedical  and 
agricultural  product  research,  we  need,  collectively  to  achieve  an 
extremely  high  standard  in  human  genetics.  We  should  start  with 
an  extremely  high  scientific  standard  and  not  waver  in  our  commit- 
ment to  that  goal. 

The  human  genome  sequence  is  part  of  that  commitment.  A  more 
important  part,  built  upon  it,  will  be  our  study  of  human  variation 
and  the  biological  consequences  of  that  variation. 

So,  I  have  some  additional  comments  in  my  written  records,  but 
I  hope,  for  the  purposes  of  this  hearing,  that  the  Congressional 
message  to  the  federal  agencies  responsible  in  this  area  will  be  that 
you  are  proud  of  your  institution's  role  in  initiating  this  project  and 
look  forward,  as  I  do,  to  the  production  of  a  sequence  that  is  freely 
accessible  to  all  sciences,  delivered  on  schedule,  and  of  impeccable 
quality. 

Thank  you. 

[The  prepared  statement  and  attachments  of  Mr.  Olson  follow:] 
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Testimony  of  Maynard  V.  Olson  before  the  House  Committee  on  Science,  Subcommittee 
on  Energy  and  Environment,  scheduled  for  June  1 7,  1 998 

The  Human  Genome  Project  has  come  a  long  ways  since  its  fragile  beginnings  a  decade 
ago    In  its  early  years,  the  proposal  to  develop  a  complete  DNA  sequence  of  the  human 
genetic  material  often  seemed  an  idea  ahead  of  its  time:  the  project's  feasibility  could 
reasonably  be  questioned,  there  was  little  support  amongst  rank-and-file  biologists,  and 
the  pharmaceutical  and  agricultural-products  industries  were  disengaged    Now,  residual 
technical  arguments  involve  minor  squabbles  between  experts,  basic  and  applied 
biological  research  is  reorganizing  itself  around  the  assumption  that  complete  genome 
sequences  will  soon  be  available  for  all  intensively  studied  organisms,  and  the 
commercial  sector  has  emerged  as  a  major  player  in  large-scale  genome  analysis.  Indeed, 
we  not  only  now  have  a  vigorous  biotech  industry — in  which  the  United  States  is  the 
undisputed  world  leader — but  a  whole  tier  of  "genomics"  companies  created  to  meet  the 
insatiable  demand  for  specialized  data  about  genomes  that  has  arisen  throughout  the 
biotechnology,  pharmaceutical  and  agricultural-products  industries. 

It  is  worth  reflecting  briefly  on  the  reasons  for  this  success.  First,  there  are  the  scientific 
fundamentals.  We  have  only  known  for  a  few  decades  that  all  life  is  based  on  digital 
information — the  "base-four"  code  of  DNA  sequence  that  is  now  featured  even  on  movie 
marquees  (as  in  the  movie  title  "GATTACA,"  which  is  simply  a  short  bit  of  DNA 
sequence  expressed  with  the  four  standard  symbols  G,  A,  T,  and  C).  The  information 
present  in  a  human  sperm  or  egg  cell  is  encoded  in  3  billion  G's,  A's,  T's,  and  C's.  Thus, 
the  total  information  content  of  the  human  genome  is  only  750  Megabytes — about  the 
capacity  of  a  compact  disc — an  awe-inspiring  level  of  data  compression. 

Although  the  challenge  of  interpreting  the  human  sequence  will  remain  a  central 
preoccupation  of  science  for  centuries  to  come,  available  sequence  data  already  yield  rich 
dividends.  Most  profoundly,  computer-based  methods  of  sequence  comparison 
frequently  allow  detection  of  functionally  informative  similarities  between  genes 
discovered  in  different  organisms.  This  feature  of  DNA  sequences  allows  biologists 
studying  human  diseases  to  infer  important  lessons  about  the  molecular  basis  of  these 
pathological  processes  through  gene-to-gene  comparisons  with  the  richly  informative 
data  already  available  about  the  genes  of  "model"  organisms  such  as  yeast  and  fruit  flies. 

A  former  member  of  this  institution.  Rep.  Claude  Pepper,  deserves  great  credit  for  having 
recognized  that  biological  research  needed  to  be  led  aggressively  into  the  information 
age.  His  support  for  establishment  of  the  National  Center  for  Biotechnology  Information 
at  the  National  Library  of  Medicine  is  one  of  the  great  success  stories  of  proactive 
involvement  by  the  Congress  in  the  building  of  research  infrastructure.  The  Wold  Wide 
Web  site  of  the  NCBI,  on  which  DNA-sequence  comparisons  are  the  central  activity,  has 
become  a  major  epicenter  of  biological  research. 

As  the  NCBI  story  illustrates,  the  present  success  of  genome  analysis  has  roots  in  policy 
as  well  as  science.  In  the  Human  Genome  Project,  Congress  was  actually  ahead  of  the 
majority  of  scientists  in  recognizing  that  it  was  time  to  move  boldly  to  create  an 
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information-based  future  for  bioitiedical  research.  The  establishment  of  the  Human 
Genome  Project,  which  led  in  a  few  years  to  the  creation  one  of  the  NIH's  most  dynamic 
and  forward-looking  Institutes,  the  National  Human  Genome  Research  Institute,  was  the 
work  of  a  relatively  small  group  of  committed  scientists  and  federal  officials,  who 
brought  a  strong  case  to  Congress  and  received  an  equally  strong  response    This 
response  was  all  the  more  impressive  given  the  draconian  budgetary  constraints  that  had 
to  be  overcome  to  bring  the  Human  Genome  Project  into  existence. 

The  Congress  now  faces  the  new  challenge  of  understanding  and  responding  to  a 
scientific  environment  in  the  Human  Genome  Project  that  has  all  the  roiling  chaos  that 
comes  with  scientific  and  policy  success.  My  basic  message  in  this  turbulant 
environment  is  quite  simple:  the  system  is  working. 

Biomedical  research  in  the  United  States  derives  its  formidable  strength  from  the  synergy 
between  three  dynamic  sectors:  academic  research,  the  biotechnology  industry,  and  the 
pharmaceutical  industry    Academic  research,  with  its  reliance  on  federal  ftinding  and  the 
stewardship  of  a  highly  evolved  resource-allocation  system  administered  by  the  NIH  and 
other  federal  agencies,  is  clearly  "the  goose  that  laid  the  golden  egg."  The 
pharmaceutical  industry  provides  a  powerful  engine  for  translating  new  research  into 
safe-and-effective  products    As  the  pace  of  biological  research  has  accelerated  following 
the  development  of  recombinant-DNA  techniques  and  the  introduction  of  other  new 
research  tools,  a  whole  industry — the  increasingly  important  biotech  sector — has  arisen 
to  respond  rapidly  to  new  commercial  opportunities.  This  sector  is  characteristically 
quicker  on  its  feet  and  more  willing  to  take  large  business  risks  than  the  pharmaceutical 
industry.  Time  will  tell  whether  the  pharmaceutical  and  biotech  sectors  ultimately  merge 
or  retain  their  currently  distinct  identities. 

The  present  landscape  in  the  Human  Genome  Project  illustrates  well  the  operation  of  all 
three  sectors.  The  academic  sector  is  focused  on  the  creation  of  a  high-quality  reference 
sequence  of  the  human  genome,  presently  targeted  for  completion  in  2005.  This  still- 
ambitious  goal  is  defined  in  terms  of  rigorous  quality-control  standards  enforced  through 
a  vigorous  process  of  peer-reviewed  scientific  performance  and  peer-assessment  of  data 
quality.  The  academic  sector  is  also  responsible  for  the  critical  task  of  training  a  growing 
cohort  of  young  scientists  who  can  lead  genome  analysis  into  its  open-ended  fliture. 
Similarly,  academic  research  is  the  incubator  in  which  new  technical  approaches  and  new 
applications  of  genome  analysis  to  biology  are  under  development. 

Increasingly,  the  pharmaceutical  industry  is  redirecting  long-term  drug-discovery 
programs  to  exploit  the  new  opportunities  provided  by  an  avalanche  of  sequence  data, 
data  that  are  leading  daily  to  the  discovery  of  new  genes,  new  proteins,  and  new 
fijnctional  dimensions  to  life  processes.  In  addition  to  its  primary  reliance  on  public- 
domain  sequence  data,  the  pharmaceutical  industry  is  building  in-house  data-collection 
capabilities — and  even  more  dramatically — pursuing  such  data  through  a  host  of 
contracts,  partnerships,  and  other  relationships  with  biotech  and  genomics  companies. 
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It  is  against  this  background  that  the  recently  announced  Perkin  Elmer  initiative  to 
accumulate  a  large  database  of  DNA  sequences  sampled  directly  from  the  human  genome 
should  be  viewed.  Although  traditionally  a  manufacturer  of  scientific  instruments  and 
research  reagents,  Perkin  Elmer  is  adopting,  in  this  venture,  an  overtly  "biotech"  style  of 
operation.  The  business  risks  are  considerable  since  it  remains  unclear  how  the  company 
will  recover  its  substantial  investment.  Furthermore,  as  is  a  hallmark  of  biotech  research, 
time  is  of  the  essence  and  publicity  is  a  key  tool  for  influencing  events.  Those  of  us  who 
are  watching  this  spectacle  from  the  sidelines  (i  e  ,  as  neither  participants  nor 
competitors)  should  wish  Perkin  Elmer  well.  The  company's  investment  will  surely 
stimulate  rapid  reduction-to-practice  of  new  reagents  and  instrumentation  and  will  also 
produce  much  data  that  will  be  both  of  commercial  value  and  basic  scientific  interest. 
However,  the  excitement  generated  by  the  well-orchestrated  public-relations  campaign 
surrounding  the  Perkin  Elmer  announcement  should  not  disguise  that  what  we  have  at  the 
moment  is  neither  new  technology  nor  even  new  scientific  activity:  what  we  have  is  a 
press  release    I  believe  that  I  speak  for  many  academic  spectators  when  I  say  that  I  look 
forward  to  a  transition  from  plans  to  reality.  In  short,  "Show  me  the  data." 

The  risk  here  for  the  publicly  funded  program  is  one  of  overreaction.  What  the  Perkin 
Elmer  initiative  offers  is  the  possibility  that  the  immediate  needs  of  the  biological 
community  during  a  period  of  2-3  years,  roughly  in  the  interval  2000-2003,  may  be 
better  met  than  would  otherwise  have  been  the  case.  I  hope  that  the  project  is  successfijl 
and  that  the  data  are  sufficiently  accessible  to  the  scientific  community  that  this  promise 
is  met.  However,  in  the  larger  scheme  of  the  Human  Genome  Project,  we  would  all  be 
unwise  to  focus  on  so  transient  a  contribution. 

The  case  for  the  transience  of  these  data's  value  lies  in  the  likelihood  that  they  will  be  of 
poor  quality.  While  I  am  prepared  to  be  proven  wrong,  as  any  scientist  must  be,  I  am 
equally  prepared  to  put  my  reputation  as  a  scientific  prognosticator  on  the  line  in 
predicting  that  the  Perkin  Elmer  initiative  will  fail  to  produce  a  sequence  of  the  human 
genome  that  will  meet  the  long-term  needs  of  the  scientific  community.  Specifically,  I 
predict  that  the  proposed  technical  strategy  for  sampling  human  DNA  sequences  will 
encounter  catastrophic  problems  at  the  stage  at  which  the  tens  of  millions  of  individual 
tracts  of  DNA  sequence  must  be  assembled  into  a  composite  view  of  the  human  genome. 
Based  on  extensive  experience  with  the  assembly  of  composite  human  DNA  sequences  in 
our  genome  center  and  other  laboratories,  I  predict  that  there  will  be  over  100,000 
"serious"  gaps  in  the  assembled  sequence:  a  "serious"  gap.  in  this  context,  is  one  in 
which  there  is  uncertainty  even  as  to  how  to  orient  and  align  the  islands  of  assembled 
sequence  between  the  gaps.  Furthermore,  I  predict  that  a  significant  fraction  of  the  small 
islands  between  serious  gaps  will  be  misassembled  (i.e.,  they  will  not  actually  correspond 
to  the  organization  of  the  human  genome). 

Even  such  fragmentary  data  will  certainly  have  considerable  biological  utility 
Furthermore,  it  may  prove  to  be  a  substantial  help  in  the  final  push  toward  a  high-quality 
human  sequence,  although  this  prospect  is  less  certain.  Experience  has  tended  to  show 
that  large  amounts  of  low-quality  sequence  data  are  a  poor  substitute  for  smaller  amounts 
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of  high-quality  data  collected  for  the  specific  purpose  of  assembling  a  contiguous, 
accurate  sequence. 

It  is  of  the  utmost  importance  that  a  vigorous  public  effort  be  maintained  that  is  directed 
toward  the  development  of  a  sequence  that  will  meet  the  test  of  time.  There  are  two 
compelling  rationales  for  aiming  high  in  terms  of  the  quality  of  this  sequence.  In 
practical  terms,  any  other  approach  will  defer  large  costs,  diffusing  them  across  the 
biomedical  research  enterprise  for  decades  to  come  as  individual  investigators  are  left  to 
complete  and  correct  the  reference  sequence  in  regions  of  the  genome  where  the  data  are 
inadequate  to  meet  their  particular  needs.  Perhaps  still  more  important  is  the  need  to  set  a 
high  standard  in  all  aspects  of  human  genetics,  starting  with  an  unwavering  commitment 
to  quality  in  the  Human  Genome  Project's  flagship  mission.  Although  I  have  confidence 
that  the  spectacular  advances  we  are  currently  witnessing  in  human  genetics  will  lead  to 
great  public  benefit,  I  do  not  share  the  view — expressed  in  some  quarters — -that  the  speed 
of  generating  data  must  take  precedence  over  all  other  considerations.  An  element  of 
caution  in  developing  this  first  comprehensive  view  of  the  human  genetic  material  is 
advisable.  High  scientific  standards  tend  to  be  infectious.  I  would  hke  the  legacy  of  my 
involvement  in  the  Human  Genome  Project  to  be  a  product  that  will  not  only  facilitate  the 
research  of  future  scientists  but  will  also  inspire  them  to  set  a  similarly  high  scientific 
standard  as  they  interpret  the  sequence  and  study  its  variation  from  one  human  to  another 
and  the  effects  of  that  variation  on  human  biology. 

For  its  part  in  bringing  about  this  fijture,  I  would  advise  Congress  to  wait  and  watch 
rather  than  to  attempt  to  provide  detailed  guidance  to  the  involved  agencies.  At  root, 
many  of  the  issues  are  deeply  technical  and  Congress  is  the  wrong  forum  in  which  to 
debate  the  relative  merits  of  capillary-gel  electrophoresis  vs.  slab-gel  electrophoresis, 
whole-genome  "shotgun"  sampling  vs.  a  clone-by-clone  approach,  and  so  forth.  The 
agencies  need  a  more  general  sense  of  how  Congress  views  the  public  benefit  associated 
with  the  Human  Genome  Project.  I  hope  that  the  Congressional  message  will  be  that  you 
are  proud  of  your  institution's  role  in  initiating  this  project  and  look  forward,  as  I  do,  to 
the  production  of  a  sequence  that  is  freely  accessible  to  all  scientists,  delivered  on 
schedule,  and  of  impeccable  quality. 

1  would  like  to  close  by  identifying  three  areas  of  concern  that  I  do  think  bear  ftirther 
scrutiny  by  appropnate  Congressional  processes.  First,  I  think  there  is  a  strong  case  for 
increased  funding  for  the  National  Human  Genome  Research  Institute,  although  my 
argument  for  increased  funding  would  differ  from  that  of  many  of  my  colleagues.  I 
believe  that  the  current  NHGRI  budget  is  actually  adequate,  in  combination  with  funding 
through  other  channels,  to  produce  a  quality  human  sequence  by  2005.  Given  the  large 
technical  uncertainties,  I  think  the  National  Research  Council  Committee  on  the  Mapping 
and  Sequencing  of  the  Human  Genome,  on  which  I  had  the  honor  of  serving,  did  a  good 
job  of  projecting  the  cost  of  the  Human  Genome  Project.  Indeed,  it  also  did  a  good  job  of 
estimating  the  time  required  to  complete  the  project.  I  doubt  that  the  current  schedule 
could  be  much  accelerated  without  encountering  human-resource  bottlenecks  that  would 
be  difficult  to  overcome. 
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However,  I  am  concerned  that  without  expanded  funding,  the  peak  phase  of  data 
production  for  human  sequencing,  will  drain  other  valuable  activities  at  the  NHGRI.  The 
^fRC  Committee  did  not  fiilly  envision  the  rapidity  with  which  genome  analysis  would 
open  up  new  opportunities  in  biological  research.  Indeed,  the  Perkin  Elmer  proposal  is 
but  one  symptom  of  the  magnitude  and  immediacy  of  these  opportunities.  While  moving 
ahead  toward  its  flagship  goal  of  producing  a  quality  human  sequence,  the  NHGRI  also 
faces  increasing  responsibilities  to  identify  and  stimulate  research  avenues  opened  by  the 
early  successes  of  the  Human  Genome  Project    These  opportunities  include  development 
of  new  technology,  improved  computational  methods  for  analyzing  DNA  sequence, 
approaches  to  the  comprehensive  functional  analysis  of  genomes,  and — perhaps  most 
profoundly — characterization  of  natural  variation  in  human  DNA.  In  my  view,  the 
strongest  case  for  increased  NHGRI  funding  lies  in  its  excellent  track  record  and  the 
continuing  expansion  of  research  opportunities  in  areas  that  go  beyond  the  Institute's  core 
mission  but  which  provide  critical  links  between  the  emerging  human  sequence  and  the 
rest  of  biological  research. 

Two  other  issues,  which  are  illustrated  by,  but  not  narrowly  related  to,  the  Perkin  Elmer 
initiative  bear  Congressional  attention.  The  most  important  concerns  the  influence  of 
intellectual-property  law  on  the  research  enterprise.  Particularly  in  areas  where  the 
interests  of  the  three  major  sectors  of  biomedical  research — academe,  the  pharmaceutical 
industry,  and  the  biotechnology  industry — diverge,  there  are  increasing  signs  of  trouble 
The  pharmaceutical  industry  has  legitimate  concerns  that  it  has  become  too  easy  for 
biotechnology  companies  to  acquire  valuable  intellectual-property  rights  through  cream- 
skimming  research  investments.  Continuation  of  the  current  system  risks  the 
accumulation  of  disincentives  for  drug  development  in  certain  areas  or,  alternately, 
diversion  of  the  attention  of  pharmaceutical  companies  into  purely  defensive  acquisition 
of  its  own  tenuous  intellectual-property  claims    Academic  research  faces  other  concerns. 
Foremost  amongst  these  are  situations  in  which  the  conduct  of  basic  research  in  the  non- 
profit sector — the  very  research  on  which  our  current  success  rests — is  distorted  by 
conflicts  over  intellectual  property  and  access  to  data.  In  the  worst  cases,  commercial 
owners  of  intellectual  property  are  using  their  property  to  attempt  to  impede  research  in 
the  non-profit  sector  when  they  do  not  see  that  research  as  compatible  with  their  short- 
term  interests. 

A  more  direct  warning  posed  by  the  Perkin  Elmer  initiative  is  that  academic  researchers 
risk  losing  equal  access  to  critical  research  tools.  These  tools,  such  as  advanced 
instrumentation  for  DNA  analysis,  are  increasingly  seen  as  a  means  through  which  their 
developers  can  acquire  intellectual  property  rather  than  as  products  in  their  own  right 
Perhaps  if  the  microscope  were  a  contemporary  invention,  we  would  find  optical 
companies  competing  to  sell  images  rather  than  microscopes.  Basic  scientists  need 
access  to  state-of-the-art  research  tools,  not  just  to  the  output  of  these  tools    However, 
the  tools  themselves  are  now  universally  refined,  manufactured,  and  marketed  by  private 
companies  rather  than  by  basic  researchers  themselves.  Hence,  tool-making  companies 
are  in  a  powerful  position  to  influence  the  directions  that  basic  research  takes  and  the 
distribution  of  that  research  between  the  non-profit  and  for-profit  sectors. 
Instrumentation  provides  one  simple  illustration  of  this  dynamic;  however,  even  mor6 
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problematic  situations  arise  in  areas  such  as  reagents,  analytical  processes,  and  reference 
databases    There  are  no  simple  answers  to  the  resultant  dilemmas,  but  the  pubUc  interest 
in  keeping  basic  researchers  well  equipped  to  do  their  work  is  clear.  The  United  States  is 
the  world  leader  in  an  area  that  is  central  to  the  human  future — biomedical  and 
agricultural  research — and  it  has  gained  this  enviable  position  by  coupling  the  world's 
strongest  system  of  research  universities  to  an  aggressive  commercial  sector.  Effort 
expended  fine-tuning  the  relationship  between  these  parties  will  be  effort  well  spent. 
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Chairman  Calvert.  Thank  you,  Doctor. 

REASONS  FOR  FEDERAL  GOVERNMENT  TO  COMPLETE  HUMAN 

GENOME  SEQUENCING 

Chairman  CALVERT.  This  question  is  first  for  Dr.  Patrinos  and 
Dr.  Colhns.  Doctors,  in  a  guest  column  in  The  New  York  Times,  Dr. 
WiUiam  Haseltine,  a  former  Harvard  Medical  School  professor  and 
CEO  of  his  own  genomics  company  said  the  following,  "It  makes  lit- 
tle sense  for  the  Federal  Government  to  go  to  the  trouble  of  decod- 
ing the  junk  DNA.  The  $3  billion  of  federal  money  now  devoted  to 
the  entire  human  genome  should  be  spent  instead  on  university- 
based  research  initiated  by  individual  medical  investigators.  The 
era  of  government-sponsored  big  science  in  which  a  few  labora- 
tories receive  as  much  as  $10  million  a  year  to  analyze  mostly  junk 
DNA,  while  scientists  doing  disease-related  research  beg  for  financ- 
ing should  end." 

At  this  point,  if  there  is  no  objection,  I  would  ask  unanimous  con- 
sent to  insert  the  entire  column  at  this  point  in  this  record  and, 
hearing  no  objection,  so  ordered. 

[The  information  referred  to  follows:] 

Chairman  CALVERT.  And  with  that,  I  assume  that  each  of  you 
disagree  and  could  you  tell  us  why? 

Dr.  Collins.  Every  new  development  in  science  or  in  public  pol- 
icy tends  to  bring  out  of  the  woodwork  individuals  with  fringe  opin- 
ions who  seek  to  take  advantage  of  that  new  development  to  pro- 
mote their  own  agenda.  In  this  instance,  the  comments  you  quote 
are  those  of  an  individual  who  has  a  transparent  financial  conflict 
of  interest  in  making  such  assertions,  given  that  the  future  of  his 
particular  business  enterprise  would  be  best  served  by  genome 
projects  of  all  sorts,  public  or  private,  ceasing  to  exist.  In  addition, 
there  are  statements  in  those  remarks  which  I  think  the  vast  ma- 
jority, I  would  say  greater  than  99  percent  of  the  scientific  commu- 
nity, would  profoundly  disagree  with.  What  Dr.  Haseltine  refers  to 
as  junk  DNA  includes  sequences  that  play  profound  roles  in  juve- 
nile onset  diabetes,  in  cancer,  in  osteoporosis,  and  many  other  dis- 
eases and  that  has  been  scientifically  demonstrated. 

So,  I  would  ask  you  not  to  consider  that  particular  point  of  view 
as  representative  of  the  mainstream  of  scientific  thought,  either 
public  or  private. 

Chairman  Calvert.  Thsink  you  for  your  clear  answer.  Dr. 
Patrinos.  [Laughter.] 

Dr.  Patrinos.  I  certainly  couldn't  have  said  it  better  myself. 

REFOCUSING  OF  FEDERAL  HUMAN  GENOME  PROJECT 

Chairman  Calvert.  Dr.  Galas,  in  his  testimony,  and  this,  again, 
is  for  Dr.  Patrinos  and  Dr.  Collins,  says  the  Federal  Grovernment 
approach  should  continue  but  should  refocus  its  goals  to  produce  a 
first  draft,  and  that  was  indicated  also  by  other  witnesses,  of  the 
human  genome  as  soon  as  possible.  Will  the  government  program 
consider  this  approach  in  collaboration  with  the  private  effort  by 
Dr.  Venter?  Would  you  like  to  respond  to  that  as  well.  Doctor?  But, 
go  ahead. 

Mr.  Patrinos.  As  I  mentioned  in  my  oral  remarks,  this  is,  in 
fact,  our  intention.  I  agree  wholeheartedly  with  what  Dr.  Galas  has 
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said  about  the  value  of  providing  this  intermediate  product  as  soon 
as  possible  and  we  certainly  plan  to  deliver  that  intermediate  prod- 
uct in  coordination  and  in  full  cooperation  with  private-sector  ini- 
tiatives such  as  the  initiative  that  Dr.  Venter  described. 

Dr.  Collins.  It  is,  actually,  worthy  of  note  that  there  is  a  plan- 
ning process  under  way  right  now  for  the  NIH  and  the  DOE  ge- 
nome programs.  Ari  and  I  work  together  on  all  of  these  planning 
processes  and  there  was  a  meeting  just  three  weeks  ago  involving 
more  than  a  hundred  scientists  from  various  fields,  most  of  them 
not  genome  scientists,  to  look  at  the  next  5  years  of  the  genome 
program.  This  subject  of  whether  or  not  the  publicly-funded  effort 
should  revise  its  strategy  in  light  of  the  new  developments  was  in- 
tensely discussed. 

I  think  it's  fair  to  say  there  is  not  complete  unanimity  on  the  an- 
swer to  that  question,  in  part,  because  of  the  uncertainty  until  that 
new  initiative  has  moved  forward  a  bit  about  exactly  what  it  will 
look  like.  But  I  can  certainly  reassure  you,  this  is  being  looked  at 
with  great  intensity  and  I'm  sure  Ari  would  agree  with  me  that  as 
that  data  be^ns  to  become  available  we  will  be  doing  everything 
possible  to  adjust  the  strategy  to  make  the  most  of  that  and  to  get 
to  the  goal  as  quickly  as  possible. 

FEDERAL  PROGRAM'S  USE  OF  LATEST  TECHNOLOGIES 

Chairman  CALVERT.  Let's  briefly  discuss  new  technologies.  There 
was  discussion  about  that  today  also.  Is  the  federal  program  using 
the  latest  technologies,  for  example,  the  new  robotics  advances  in 
the  last  several  years  in  our  endeavor  on  our — answer  the  question. 
Doctor? 

Mr.  Patrinos.  There  are  indeed.  Certainly  among  both  our  lab- 
oratory and  academic  performers  in  the  human  genome  project 
there  are  many  examples  of  cutting-edge  technologies  in  robotics, 
sequencing  technologies  in  general.  This  is  a  field,  of  course,  that 
is  rapidly  changing.  Advances  are  expected,  as  I  mentioned  earlier, 
and  probably  will  be  the  norm  rather  than  the  exception,  the  sur- 
prising new  developments,  that  is. 

Dr.  Collins.  I  would  agree  with  that.  In  fact,  I  would  add  to  it 
the  federally-funded  effort  is  not  only  using  the  new  technology, 
we're  developing  a  lot  of  it.  The  NIH  component  of  the  Human  Ge- 
nome Project  spends  $20  million  a  year  on  technology  development. 
One  of  our  successes  is  the  DNA  chip  which  was  founded  on  the 
basis  of  a  company  that  got  going  with  an  NIH  grant  about  4  or 
5  years  ago.  So  we  are  intensely  interested  in  technology  develop- 
ment. Many  of  our  grantees  are  engineers,  they  are  not  necessarily 
all  biologists,  computer  scientists,  robotics  experts,  and  the  like. 
This  is  part  of  our  goal. 

Mr.  Patrinos.  L^t  me  add  also  one  thing.  At  least  the  Depart- 
ment of  Energy  is  investing  some  modest  amount  of  funding  in 
some  of  the  cutting-edge  technologies  that  we  expect  will  be  in 
place  not  in  the  next  few  years,  but  maybe  10-20  years  from  now, 
ones  that  are  sort  of  blue  sky  right  now  with  respect  to  their  fea- 
sibility because  we  know  that  technology  changes  very,  very  quick- 
ly and  how  we  sequence  20  years  from  now  will  probably  be  en- 
tirely different  than  how  we  are  sequencing  today. 

Chairman  Calvert.  Thank  you.  Mr.  Roemer. 
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FEDERAL  BUDGET  FOR  THE  HUMAN  GENOME  PROJECT 

Mr,  ROEMER.  Thank  you,  Mr.  Chairman.  I  first  of  all  want  to 
thank  the  panel  once  again  for  your  very  helpful  testimony  on  a 
very  complicated  subject.  Certainly  in  my  background  in  political 
science  and  in  other  areas  that  prepared  me  maybe  better  for  run- 
ning for  Congress  than  it  did  contemplating  many  of  these  very 
complicated  questions  that  you  experts  deal  with,  we're  very  appre- 
ciative for  your,  not  only  your  expert  testimony  but,  I  think  the 
way  that  you've  also  presented  your  testimony  today  as  well  too, 
in  a  very  helpful,  very  persuasive,  and  very  collaborative  sense.  We 
haven't  had  complete  unanimity  from  the  panel  today  and  I  want 
to  get  to  that  point.  But  first  of  all,  Dr.  Venter,  I  want  to  ask,  to 
make  sure  that  I  heard  your  remark  and  clarify  on  it.  You  said 
that  in  this  collaborative  effort,  you  would  not  encourage  Congress 
to  cut  the  budget.  In  fact,  you  would  encourage  the  Congress  to  in- 
crease the  budget  for  this  particular  project,  even  though  we're  see- 
ing this  collaborative  public/private  partnership.  Is  that  correct? 

Mr.  Venter.  That's  absolutely  correct,  but  not  just  for  sequenc- 
ing of  humans.  It  is  because  we're  going  to  have  the  sequence  so 
much  faster  that  we  can  now  move  to  the  phase  that  all  of  us  hope 
to  in  the  envisioning  of  the  human  genome  project  in  the  first  place 
is  starting  to  interpret  and  understand  that  genetic  code.  It  will  not 
be  interpretable  without  having  mouse  and  other  genome  se- 
quences so  the  fact  that  human's  going  to  be  there  faster,  we  need 
mouse  even  faster.  Of  the  60,000-80,000  human  genes,  there's  only 
around  5,000  of  those  genes  that  have  full-length  cDNA  sequences 
available  to  the  worldwide  community.  Stepping  up  the  effort  so 
that  every  one  of  those  human  genes  has  a  full-length  cDNA  se- 
quence, which  can  be  done  on  a  very  broadly-distributed  effort  in 
America's  universities,  we'll  move  forward  to  make  sure  we  have 
the  tools  on  a  broadest  possible  sense  for  everybody  to  use.  There's 
more  reasons  to  fund  more  genomic  research  now  than  there  ever 
has  been. 

Mr.  RoEMER.  So  your  testimony,  which  is  very,  you  know,  per- 
suasive and  compelling  testimony,  you  say  that  in  this  collabo- 
rative effort,  you  are  not  replacing  something  that  is  being  done  in 
the  publicly-funded  research.  In  fact,  in  this  collaborative  effort, 
you  are  working  together  in  a  partnership  and  that  does  not  mean 
that  slices  should  be  taken  out  of  the  existing  budget. 

Mr.  Venter.  Well,  we're  certainly  trying  our  best  to  work  to- 
gether and  I  don't  think  an3rthing  should  be  taken  out  of  the  budg- 
et. I've  heard  from  some  of  my  colleagues  here  that  they've  been 
criticized  for  wasting  federal  dollars  based  on  this  new  announce- 
ment. I  think  that's  a  very  unfair  and  unfortunate  use  of  our  an- 
nouncement for  people  who  have  the  agenda  to  attack  the  pro- 
grams. I  think  it's  a  very  different  situation  3  years  from  now,  per- 
haps looking  back,  if  we  are  successful,  and  we  would  not  have 
made  this  announcement  if  we  didn't  intend  to  be,  but  I  think  we, 
we  want  to  be  judged  on  our  accomplishments,  not  by  our  press  re- 
leases or  announcements,  and  our  accomplishments,  hopefully,  will 
show  that  it's  wise  to  change  the  directions  currently  under  way  to 
work  with  us  in  a  collaborative  fashion  to  move  this  important  re- 
search forward  faster  for  everybody. 
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DR.  OLSON'S  CRITICISMS  OF  PRIVATE-SECTOR  VENTURE 

Mr.  ROEMER.  Thank  you.  Dr.  Collins,  you  said  in  your  testimony, 
I  believe  you  said  in  your  testimony,  that  you  had  worked  at  the 
University  of  Michigan  and  you  had  worked  on  the  cystic  fibrosis 
and  Huntington  disease  sti*ucturing  or  the  DNA  researching  and 
that  that  had  taken  close  to  a  decade.  You  got  some  pretty  strong 
criticism  fi-om  Dr.  Olson,  even  though  you  have  some  practical  ex- 
perience in  academic  life,  he  used  pretty  strong  words  such  as  this 
is  science  by  press  release,  this  is  public  policy  by  press  release.  He 
predicted  there  are  going  to  be  100,000  gaps  in  the  final  product 
and  misassembled  data  and  so  forth.  How  do  you,  as  somebody  that 
has  been  in  his  shoes  in  academic  life  at  the  University  of  Michi- 
gan, respond  to  this  rather  strong  criticism  and,  well,  let  me  leave 
it  at  that.  And,  I  would  just  say  that  you  certainly  were  not  shy 
when  it  came  to  your  remarks  about  Dr.  William  Haseltine's  re- 
marks as  well  too. 

Dr.  Collins.  Mr.  Roemer,  I  think  there's  a  little  confusion  in  the 
nature  of  Dr.  Olson's  remarks.  Again,  I'm  the  person  who  is  respon- 
sible for  overseeing  the  federally-funded  effort  at  the  National  In- 
stitutes of  Health  on  the  genome  project.  I  believe  his  comments 
about  difficulties  in  assembling  the  structure  were  related  to  the 
announcement  by  Perkin-Elmer  and  Dr.  Venter  and  not  directed  at 
the  publicly-funded  effort. 

As  a  researcher  who  worked  on  cystic  fibrosis  and  was  fortunate 
to  lead  one  of  the  two  teams  that  worked  together  to  find  that 
gene,  I  can  tell  you  that  the  10  years  that  went  by  during  that  en- 
terprise where  I,  as  a  physician,  had  to  keep  explaining  to  families 
whose  children  were  increasingly  getting  sicker  that  we  hadn't 
found  the  gene  yet  because  it  was  just  too  hard,  were  among  the 
more  frustrating  years  of  my  life  and  I  don't  wish  that  on  anybody 
in  the  future.  And  that  is  one  of  the  major  motivators  to  do  this 
project  and  to  do  it  right.  Actually,  Dr.  Olson  and  I  are  pretty  much 
in  sync  on  this.  I  do  believe  that  until  the  Perkin-Elmer  effort  has 
produced,  over  the  course  of  the  next  2  or  3  years,  the  data  that 
will  be  required  to  evaluate  this  strategy,  that  exactly  what  kind 
of  a  product  comes  out  of  it  is  not  knowable.  It's  not  that  we're  just 
not  doing  our  homework  to  know  it,  it's  not  knowable.  It  is  a  prob- 
lem that  hasn't  been  tried  before  and,  therefore,  I  agree  with  Dr. 
Olson  that  the  publicly-funded  effort,  which  Dr.  Patrinos  and  I  are 
responsible  for,  should  not  drastically  alter  our  strategy  which  is 
targeted  toward  having  this  final  complete,  highly-accurate  product 
until  we  have  some  more  data. 

Mr.  ROEMER.  But  I'm  asking  you  objectively  as  a  scientist  to  com- 
ment on  Dr.  Olson's  remarks  about  Dr.  Venter,  that's  what  the 
question  was  about,  not  a  confusion  as  to  where  the  criticism  was 
coming  from — or  where  it  was  directed. 

Dr.  Collins.  I  think  as,  as  I  tried  to  say,  that  this  approach  to 
putting  together  the  human  genome  sequence  is  bold.  It  is  of  uncer- 
tain success  value.  It  could  be  that  2  or  3  years  from  now,  as  Dr. 
Olson  is  predicting,  we  end  up  with  a  rough  draft  which  is  actually 
rough  enough  that  it  is  very  difficult  to  work  with.  The  publicly- 
funded  effort  is  probably  the  only  part  of  this  enterprise  that's  ab- 
solutely dedicated  to  obtaining  the  completely  contiguous,  highly- 
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accurate,  close-all-the-gaps  enterprise  and  I  think  we  need  to  take 
that  responsibility  and  take  it  seriously  and  will  continue  to  do  so. 
But  I  welcome  this  new  initiative  and  look  forward  to  seeing  what's 
going  to  happen.  It's  a  scientific  experiment;  we  like  that.  Sci- 
entists are  energized  by  the  opportunity  to  see  a  new  approach 
tried  out.  It  will  take  a  while  to  find  out,  but  that's  what  science 
is  all  about. 

Mr.  ROEMER.  So,  you  are  consistent  in  your  initial  enthusiasm 
with  your  testimony  for  Dr.  Venter's  efforts;  however,  you  do  have 
concerns  as  a  scientist  as  to  what  it  may  produce.  You  may  not 
agree  with  some  of  Dr.  Olson's  conclusions,  but  you  are  saying  that 
first  of  all,  this  effort  should  go  forward;  secondly,  you  are  excited 
about  the  potential;  thirdly,  you  do  have  questions  as  Dr.  Olson 
does  about  what  the  outcome  may  be? 

Dr.  Collins.  I  think  every  scientist  has  to  agree  with  Dr.  Olson 
when  he  says  show  me  the  data;  then  I  will  make  up  my  mind. 

ETHICAL,  LEGAL  AND  SOCL\L  CONCERNS 

Mr.  ROEMER.  Dr.  Patrinos,  you  said  in  your  initial  testimony  as 
well,  that  you're  excited,  you  support  this  collaborative  effort.  You 
also  said  that  you  have  some  ethical  and  legal  and  social  concerns. 
Can  you  be  a  little  bit  more  specific  as  to  what  those  might  be  and 
do  they  come  back  to  some  of  Dr.  Olson's  concerns  about  access, 
privacy,  or  any  of  those  other  issues? 

Mr.  Patrinos.  Of  course,  as  you  know  the  Human  Grenome 
Project  from  the  very  beginning  identified  the  ethical,  legal,  and  so- 
cial implications  of  this  project  as  very  important,  in  fact  the  HGP 
carved  a  significant  piece  of  the  budget  from  the  very  beginning  to 
deal  with  those  issues  and  that's  something  we've  been  doing.  Dr. 
Collins  and  I,  for  quite  some  time.  My  comment  was  mostly  made 
in  the  context  of  the  faster  delivery  of  the  product.  In  a  sense  the 
faster  delivery  of  the  product  will  confront  us  with  many  of  the  eth- 
ical and  legal  and  social  implications  of  the  project  that  have  been 
articulated  by  many  of  the  scientists  and  the  science  managers  in- 
volved  

Mr.  Roemer.  Please  give  me  some  examples  of  what  that  con- 
flict  

Mr.  Patrinos.  Issues  of  privacy  and  confidentiality  of  genetic  in- 
formation, issues  of  insurance  and  employment  discrimination,  the 
multitude  of  issues  in  forensics.  You  know  the  list  is  endless,  we 
can  have  an  entire  hearing  solely  devoted  to  this  as  I'm  sure  Dr. 
Collins  would  be  delighted  to  have  such  a  hearing  because  this  is 
one  of  his  very  important  private  concerns.  So  I  was  making  ref- 
erence to  the  issue  of  having  that  information  faster  than  perhaps 
we  had  expected  a  few  years  back  and,  thus,  forcing  us  to  confront 
some  of  these  issues  sooner  rather  than  later. 

Mr.  Roemer.  I  would  hope  that  our  Chairman  might  be  ame- 
nable to  having  another  hearing  on  that  and  learning  of  some  of 
those  potential  problems  and  gleaning  maybe  some  of  the  potential 
answers  to  those  problems  and  maybe  having  sin  ethicist  as  well 
to  discuss  what  those  might  be.  With  that,  I  understand  my  col- 
league from  Michigan  has  to  leave  the  hearing  and  I'd  be  happy  to 
yield  back  the  time,  although  I'm  sure  the  Chairman  has  been  very 
patient  with  me  and  I  don't  have  any  time  left,  so. 
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Chairman  Calvert.  Well,  we  certainly  can  come  back  for  an- 
other round,  so,  that's  not  a  problem.  Mr.  Ehlers. 

Mr.  Ehlers.  Thank  you,  Mr.  Chairman.  It  is  a  very  interesting 
hearing.  I  apologize  for  being  a  little  bit  late,  but  it's  been  one  of 
those  days  again.  It  all  sounds  terribly  complicated  to  me,  then 
maybe  because  I'm  a  physicist  I  am  used  to  dealing  with  simple 
problems,  just  electrons  and  nucleae  and  quarks,  and  so  forth. 

PATENTABILITY  OF  HUMAN  GENOME 

Mr.  Ehlers.  Dr.  Venter,  I  think  I  understand  the  difference  be- 
tween your  approach  and  what  we  may  call  the  standard  approach 
but  I'm  interested  in  your  comment  that  you,  in  your  written  testi- 
mony you  say  that  this  will,  your  actions  will  basically  make  the 
human  genome  unpatentable.  Can  you  explain  that  to  me?  Are  you 
saying  that  you  are  going  to  wipe  out  so  much  of  it  that,  and  you're 
not  planning  to  patent  it,  that  no  one  will  be  able  to,  or  what?  Just 
what  do  you  mean  by  that? 

Dr.  Venter.  Well,  our  plan,  as  we've  announced  in  our  so-called 
press  barrage  was  that  we  do  plan  to  make  the  sequence  data  we 
generate  over  the  next  couple  of  years  on  the  complete  human  ge- 
nome accessible  to  the  public.  We  do  not  plan  to  patent  that  human 
genome  sequence,  the  human  chromosomes,  or  the  complete  ge- 
nome. In  fact,  by  putting  it  in  the  public  domain  as  the  individuals 
who  sequence  that  information,  if  we  do  not  patent  it,  we  will  be 
making  it  and  rendering  it  unpatentable  by  others.  However,  we 
will  be  using  that  sequence  as  the  beginning  for  discoveries,  as  all 
others  will  be  able  to,  once  we  release  it  to  discover  new  genes  that 
are  key  for  pharmaceutical  development,  new  hormones  that  could 
become  pharmaceutics  themselves  and  the  key  to  understanding 
key  human  diseases. 

Some  of  those  genes,  such  as  the  gene  for  human  insulin  when 
Genentech  patented  it,  that  allowed  the  process  to  begin  for  human 
insulin  to  be  available  to  diabetics  as  a  drug  because  someone  was 
willing  to  produce  it.  We  will  be  patenting  cDNA's  in  a  limited 
number  for  new,  exciting  discoveries  that  we  make  with  the  ge- 
nome. The  human  chromosome  sequence  itself  and  the  human  ge- 
nome will  be  unpatented  by  us  and  because  we  will  be  doing  this 
so  quickly,  we  are  going  to  render  it  unpatentable  by  others. 

difference  between  federal  human  GENOME  PROJECT  AND 

PRIVATE-SECTOR  VENTURE 

Mr.  Ehlers.  Let  me  ask  another  question.  I've  done  some  experi- 
ments which  demand  extreme  precision,  parts  and  10  to  the  9th, 
and  very,  very  careful  work  over  some  time.  I've  also  done  some 
which  are  called  quick  and  dirty  where  you  are  just  trying  to  out- 
line the  parameters  of  something  to  decide  whether  or  not  there  is 
something  worth  investigating  there.  Is  that,  in  a  sense,  the  dif- 
ference between  the  so-called  human  genome  project  and  your 
work? 

Dr.  Venter.  Absolutely  not.  In  fact,  I  appreciate  you  asking  that 
question.  Quick  does  not  mean  dirty.  Quick  means  better  tech- 
nology, better  approaches,  new  strategies.  We're  going  to  be  se- 
quencing the  human  genome  10  times.  The  sequences  that  we've 
done  in  the  past  are  some  of  the  most  accurate  sequences  ever  put 
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in  the  public  domain  by  any  scientist  and  we're  going  to  have  the 
same  standard  for  the  sequences  that  we  do  with  the  human  ge- 
nome. It's  a  completely  different  strategy;  in  fact,  we  think  it's  a 
scientifically  more  justifiable  strategy  than  rel3dng  on  clones  that 
have  been  processed  several  times,  coming  from  limited  parts  of  the 
genome,  not  necessarily  reflecting  the  entire  genome.  We're  start- 
ing with  the  entire  set  of  human  DNA,  the  entire  set  of  chro- 
mosomes and  using  that  £ind  going  right  into  the  sequencing  ma- 
chines to  generate  the  data.  We're  reljdng  on  new  algorithms  we've 
developed,  new  strategies  we've  developed,  and  the  very  forefront 
of  computing  to  be  able  to  reassemble  all  these  pieces  into  the  ge- 
nome. 

Mr.  Ehlers.  So  your  statement  would  be  that  your  method  is 
going  to  yield  results  with  the  same  completeness  and  the  same  ac- 
curacy as  the  Human  Genome  Project? 

Mr.  VE^^^ER.  We  actually  feel  that  our  approach  is  going  to  yield 
more  completeness  and  at  least  the  same  level  of  accuracy  as  done 
by  the  best  groups,  including  our  own  that  have  now  been  sequenc- 
ing the  human  genome  by  the  existing  strategy.  It  is  unknown,  you 
know,  my  colleagues  are  correct  in  characterizing  this  as  an  experi- 
ment. But  some  of  these  same  individuals  are  the  same  ones  that 
criticized  our  approach  to  sequence  the  hemophilus  influenza  ge- 
nome. In  fact,  one  of  the  questions  I  get  asked  most  often  is  why 
didn't  we  just  apply  to  the  Federal  Grovemment  for  funds  to  do  this 
new  strategy. 

Well,  I  think  it's  clear,  Maynard  Olson  is  the  Chairman  of  that 
review  committee  and  I  think  you've  heard  the  comments.  I  think 
if  we  went  and  asked  for  $300  million  to  do  this  new  project,  that 
they  might  get  some  good  chuckles  out  of  it,  but  it's  not  the  way 
new  initiatives  can  be  made. 

Mr.  Ehlers.  So,  basically,  what  I  hear  you  saying  is  it's  not  the 
contrast  between  the  precise,  complete  experiment  and  the  quick- 
and-dirty  experiment  but  rather  the  contrast  between  a  bureau- 
cratic risk-free  approach  and  a  more  thoughtful  modem  approach. 

Mr.  Venter.  I  think  that  would  characterize  my  view  quite  well. 

RECAPTURING  PRIVATE  INVESTMENT 

Mr.  Ehlers.  All  right.  Next  question.  You  mentioned  $300  mil- 
lion. If  you  are  putting  $300  million  in,  obviously,  you  hope  to  get 
a  return  on  that,  or  at  least  your  investors  do.  How  will  you  recap- 
ture your  investment? 

Mr.  Venter.  Well,  the  goal,  in  fact,  the  strategy  that  we're  tak- 
ing proves  our  philosophy  that  getting  the  sequence  is  only  the  first 
step.  And  while  we  feel  morally  compelled  to  release  that  genome 
sequence  to  the  entire  public,  and  the  companies  that  have  pro- 
ceeded on  the  basis  of  secrecy  are  taking  things  very  much  in  the 
wrong  direction,  the  business  strategy  is  going  to  be  building  the 
ultimate  genome  database  relating  every  bit  that  we  c£in  of  the 
human  genome  information  out  to  individuals,  to  physicians,  to 
biotech  companies  and  pharmaceutical  companies.  On  the  other 
side,  and  one  of  the  things  that  comes  out  of  this  whole  genome 
strategy  that  hasn't  been  discussed,  is  we  get  the  sequence  from 
both  chromosomes,  both  alleles,  and  we're,  in  the  first  three 
months  of  operation,  going  to  have  over  3  million  pol3nnorphic  vari- 
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ations  that  we're  going  to  use  as  the  basis  for  setting  up  high 
throughput  screening  of  patients,  of  individuals,  in  part  for  the 
pharmaceutical  industry  as  a  basis  for  the  new  clinical  trials  strati- 
fying patients.  This  is  going  to  be  the  basis  of  the  future  of  individ- 
ualized medicine  and  we  feel  we  can  build  a  very  major  business 
without  rel3dng  on  secrecy  and  allowing  other  people  to  use  the 
same  sequence,  discoveries,  for  their  businesses  and  for  their  own 
scientific  discoveries. 

Mr.  Ehlers.  Thank  you.  I  find  this  very  interesting  and,  as  Dr. 
Collins  observed,  this  is  an  experiment.  I  will  be  very  interested  in 
seeing  the  results  of  the  experiment  and  it  will  be  fun  to  get  you 
back  in  about  3  or  4  years  and  read  your  prepared  testimony  and 
your  answers  back  to  you  at  that  point. 

Mr.  Venter.  Thank  you.  I  appreciate  that. 

Mr.  Ehlers.  And  find  out  who  really  was  out  on  this  one.  Thank 
you  very  much. 

Mr.  Venter.  Thank  you. 

Chairman  Calvert.  Mr.  Ehlers.  Mr.  Bartlett. 

Mr.  Bartlett.  Thank  you  very  much,  and  I  apologize  for  not 
being  able  to  be  here  for  the  testimony. 

TENSION  BETWEEN  FREE  MARKET  AND  WIDE  INFORMATION 

DISSEMINATION 

Mr.  Bartlett.  We  obviously,  as  a  society,  have  two  objectives 
that  are  in  tension  here.  One  is  the  objective  to  make  knowledge 
of  the  genome  widely  available  so  it  will  benefit  the  maximum 
number  of  people.  The  other  is  to  use  competition  which,  wherever 
it's  used  in  our  free  market  society  makes  the  product  or  the  serv- 
ice better  and  it  makes  it  cheaper.  And,  obviously  these  two  things 
tend  to  be  in  tension  here.  How  do  we  proceed  so  that  we  maximize 
the  contributions  that  competition  will  make  and,  yet,  be  assured 
that  we  are  going  to  have  as  wide  a  possible  dissemination  of  this 
information  so  that  there  will  be  the  maximum  benefit  from  it? 

Mr.  Venter.  I  assume  that  question  is  for  me? 

Mr.  Bartlett.  Well,  for  whoever. 

Mr.  Venter.  Okay.  Well,  we're  going  to  be  disseminating  our  in- 
formation, first  in  terms  of  the  raw  sequence  itself  will  be  provided 
to  the  world  for  free  and  also  the  world  will  have  access  to  this  new 
database  that  we're  building.  We're  not  here  to  try  to  persuade 
NIH  or  DOE  or  anybody  else  not  to  do  what  they  are  doing.  We're 
not  concerned  with  competition.  I  would  hate  to  see  the  federal 
budget  cut  because  of  the  basis  of  what  we're  doing.  I  think  we  can 
proceed  much  better  if  we  work  together.  There's  clear  complemen- 
tary approaches  taken  with  both  strategies  that  will  3deld  a  much 
more  complete,  faster  product,  even  sooner  than  we  could  possibly 
anticipate.  We  would  like  to  be  judged,  as  I  said  earlier,  on  what 
we  accomplish.  We're  not  concerned  with  competition,  other  than 
my  concern  is  as  a  scientist  who  first  spent  10  years  at  the  NIH 
and  before  that  10  years  trjdng  to  get  NIH  grants,  my  institution 
is  totally  funded  by  NIH,  DOE,  NSF,  and  Department  of  Defense 
grants.  I  have  as  much  concern  for  the  public  funding  of  science  as 
I  do  for  the  private  funding  of  science  and  if  it  goes  in  the  wrong 
direction,  we  all  lose  from  that  proposition. 
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Dr.  Collins.  Could  I  add  a  comment?  I  think  you  asked  a  very 
appropriate  question  about  how  to  balance  these  two  forces,  but  I 
think  this  is  a  very  good  example  where  those  two  forces  actually 
are  sjmergistic  on  both  counts.  Having  a  public/private  partnership 
of  this  sort  should  speed  up  getting  the  final  product,  that's  the  na- 
ture of  a  synergism,  a  collaboration,  if  it  works,  and  we  are  deter- 
mined to  see  that  it  does  work.  But  I  believe  having  the  public  ef- 
fort continue  to  be  vigorously  involved  in  this  as  much  or  more  so 
than  they  have  been,  is  also  the  best  insurance  that  the  data  is 
made  publicly  accessible.  I  do  not  question  for  a  moment  Dr. 
Venter's  sincerity  in  his  statement  that  this  data  will  be  made 
available  on  a  quarterly  basis  in  a  database  that  anybody  can  look 
at.  I  know  that  that  is  what  he  is  committed  to  doing.  But,  after 
all,  the  sequence  of  the  human  genome  is  of  such  profound  impor- 
tance, that  I  think  a  scenario  where  large  quantities  of  it  were  only 
available  within  the  database  of  a  single  private  entity  might  be  a 
rather  unstable  situation.  If  business  demands  were  to  change  or 
personnel  were  to  change  or  the  stockholders  were  to  decide  it's  not 
such  a  good  thing  to  be  giving  this  all  away  an5anore,  one  would 
not  want  to  see  a  circumstance  where  the  publicly-funded  effort 
was  suddenly  found  to  have  dropped  the  ball.  We  don't  intend  to 
drop  the  ball. 

Mr.  Bartlett.  Thank  you.  I  am  very  supportive  of  private-sector 
funds  in  this  kind  of  scientific  endeavor.  Our  federally-funded  sci- 
entific organizations  have  done  an  exemplary  job  through  the  year, 
through  the  years,  but  in  spite  of  that,  I  have  a  growing  concern 
that  when  you  have  put  all  of  your  eggs  in  this  basket  which  is 
controlled  by  a  Congress  which  can,  which  can  change  course  very 
quickly,  that  we  put  the  future  of  science  at  risk.  And  so  I  am  very 
supportive  of  any  mechanism  which  attracts  more  private-sector 
funds  and  more  competition.  I  think  that  whenever  you  have  all  of 
the  direction  of  a  program  under  the  control  of  a  single  entity,  in 
this  case,  ultimately  the  Congress,  I  think  that  you,  that  you  buy 
some  risk  that  you  don't  need  to  buy,  if  the  ventures  are  broadly 
supported  through  competitive  infusion  of  private-sector  funds.  So, 
thank  you  very  much  for  your  answers. 

Chairman  Calvert.  Thank  you,  Mr.  Bartlett.  When  you  say 
things  change  rapidly,  everything  except  this  Congress.  Mr.  Roe- 
mer,  do  you  have  any  concluding  questions? 

CONCERNS  ABOUT  PUBLIC  ACCESS  TO  INFORMATION 

Mr.  ROEMER.  Yes,  Mr.  Chairman,  just  one  or  two,  and  I  appre- 
ciate getting  into  a  second  round  here.  I'm  reading  from  a  Washing- 
ton Post  article,  Tuesday,  May  12,  1998,  and  in  it,  I  quote  Dr. 
Olson  saying,  "Even  though  there  are  promising  public  access,"  and 
I  guess  you  mean  Dr.  Venter's  group? 

Mr.  Olson.  I  haven't  read  that  article 

Mr.  RoEMER.  "They  control  the  terms  and  there  is  a  history  of 
terms  being  more  onerous  than  is  acceptable  to  most  scientists."  Is 
that  your  quote? 

Mr.  Olson.  I  haven't  seen  the  article  in  question,  but 

Mr.  RoEMER.  Does  that  sound  like  your  quote? 

Mr.  Olson.  Sounds  like  me.  [Laughter.] 
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Mr.  ROEMER.  Can  you  clarify  what  you  meant  by  that  quote  and 
maybe  we  can  get  Dr.  Venter  to  respond  to  that? 

Mr.  Olson.  Well,  as  I  say,  it  would  help  if  I  had  a  little  more 
context,  but,  at  the  close  of  my  written 

Mr.  ROEMER.  Let  me  try  to  help  you  there,  Dr.  Olson,  because 
I'm  not  sure  if,  you  know,  in  a  newspaper  article,  they're  limited 
by  space  and  I'm  not  sure  how  they  can  provide  in  terms  of  the 
lead-in.  The  previous  paragraph  says,  "These  companies  have  been 
granted  scores  of  patents  on  their  genetic  discoveries  raising  fears 
among  some  critics  that  a  handful  of  companies  will  control  the 
commercialization  of  a  vast  and  potentially  lucrative  biological  re- 
source. Those  fears  arose  again  yesterday  when  Venter  announced 
his  new  project,"  then  your  quote. 

Mr.  Olson.  I  see,  well,  at  the  close  of  my  written  testimony,  I 
actually  encouraged  the  Congress  to  keep  careful  track  of  the  im- 
pact of  intellectual  property  issues,  particularly  on  basic  research 
which  is  my  interest.  And  I  do  encourage  you  to  do  so.  I  share  Con- 
gressman Bartlett's  view  that  this  dynamic  involvement  of  multiple 
sectors  is  critical  to  the  health  of  contemporary  science. 

My  own  interest  happens  to  be  in,  my  most  vital  interest  hap- 
pens to  be  in  the  public  sector,  and  I  think  what  I  was  referring 
to  there,  in  the  short  history  of  proprietary  databases,  and  these 
databases,  which  are  privately  funded  are  at  their  inception  propri- 
etary and  should  be  proprietary,  they're  paid  for  by  private  funds, 
that  there  is  a  history  of  the  data  being  made  available  to  academic 
investigators  only  in  return  for  what  are  sometimes  called  reach- 
through  agreements  in  which  subsequent  discoveries  made  by  aca- 
demic investigators  using  those  data  will  be,  the  intellectual  prop- 
erty status  of  these  subsequent  discoveries  will  be  influenced  by 
the  agreement  that  must  be  signed  at  the  time  that  the  data  are 
made  available.  And  I  think  I  was  simply  trying  to  make  the  point 
in  this  context  that  there  are  different  degrees  of  accessibility  and 
I  think  most  scientists  are  comfortable,  particularly  with  genome 
sequence  data,  that  it  be  absolutely  unimpeded  by  hidden  costs. 

Mr.  ROEMER.  So  your  reference  of  onerous,  terms  more  onerous 
than  is  acceptable  to  most  scientists,  would  refer  to  these  reach- 
back  provisions 

Mr.  Olson.  Yes. 

Mr.  ROEMER.  That  are  sometimes  used.  Dr.  Venter,  I  want  to 
give  you  time  to  respond  to  that.  You  say  in  the  next  paragraph 
that  with  the  exception  of  perhaps  100  to  300  genetic  sequences 
that  you  expect  will  show  special  commercial  promise,  the  company 
will  make  all  the  genetic  information  available  free  to  the  world's 
scientists.  You  say,  I  quote,  excuse  me,  you  said,  and  I  quote,  it 
would  be  morally  wrong  to  hold  the  data  hostage  and  keep  it  se- 
cret, unquote.  Is  it  morally  wrong  to  keep  the  100  to  300  genetic 
sequences  from  this  same  kind  of  scrutiny  or  providing  this  to  the 
scientific  community? 

Mr.  Venter.  Well,  as  Dr.  Olson  knows  from  his  own  work  on  the 
pseudomonas  originosa  genome  with  private  companies,  there  is  a 
big  difference  between  secrecy  and  accessibility.  One  hundred  per- 
cent of  the  sequence  that  we  will  generate  will  be  publicly  avail- 
able. We  will  be  putting  it  in  the  public  domain.  Having  intellec- 
tual property  rights  on  specific  genes  have  no  impact  on  Dr.  Olson 
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or  anybody  else.  They  allow  whatever  company  has  those  rights  the 
ability  to  commercially  produce  that  product,  whether  it  be  insulin 
or  raythocroeatin,  whether  key  drugs  that  have  a  tremendous  im- 
pact on  human  health. 

I  agree  with  Dr.  Olson's  concerns  about  reach-through  rights  and 
we've  made  that  a  key  tenet  of  our  philosophy.  In  fact,  putting  the 
human  sequence  in  the  public  domain  guarantees  that  there  are  no 
rights,  reach-through  or  otherwise,  that  come  with  this.  Any  licens- 
ing that  we  do  will  not  have  reach-through  rights.  We're  basing 
this  company  and  the  commercial  aspects  on  this  on  building  the 
best  database  ever.  If  it's  not,  nobody  will  pay  to  have  access  to  it 
because  they  won't  want  it.  If  we  can't  measure  poljonorphisms 
faster  and  better  and  more  meaningfully  than  anybody  else,  we 
won't  make  money.  If  the  genes  we  discover  don't  have  an  impact 
on  medicine,  nobody  will  want  to  license  those.  None  of  those  have 
any  impact  whatsoever  on  whether  the  fundamental  data  is  widely 
and  freely  available  to  others. 

CONSEQUENCES  OF  mXELLECTUAL  PROPERTY/PATIENT/PRIVACY 

RIGHTS 

Mr.  ROEMER.  Finally,  Dr.  Collins,  let  me  just  end  with  this  final 
question  and  I'm  not  sure  that  I  will  phrase  it  the  way  that  I  want 
so  bear  with  me.  Is  there,  then,  a  difference  here  that  we're  speak- 
ing about  in  this  collaborative  effort  that  if  Dr.  Venter's  group  se- 
quences the  DNA,  does  the  DNA  sequencing  for  some  form  of  can- 
cer, or  Parkinson's,  or  Alzheimer's  and  has  a  patent  or  privacy  on 
that,  is  there  different  access,  then,  for  that  particular  scientific 
knowledge  than  there  would  be  under  the  research  that  the  NIH 
and  DOE  are  doing?  And  what  are  the  consequences  of  that? 

Mr.  Collins.  These  are  subtle  and  difficult  questions,  but  let  me 
do  the  best  I  can.  The  way  that  the  publicly-funded  effort  is  going 
forward  is  that  we  insist  that  our  grantees,  who  are  working  at 
universities  all  over  the  country  and  also  at  the  DOE  labs  (and  this 
also  applies  on  the  international  scene  to  the  large-scale  genome 
sequencing  efforts  that  are  going  on  in  other  countries)  deposit 
their  sequence  data  within  24  hours  of  the  time  it  reaches  an  as- 
sembly of  2,000  letters  in  a  row  or  more. 

We  are  not,  at  the  NIH,  allowed  to  deny  our  grantees  the  oppor- 
tunity to  file  for  intellectual  property  rights  on  things  they  discover 
with  NIH  funds,  because  of  the  Bayh-Dole  Act.  So,  we  cannot  tell 
them  not  to  do  that,  but  by  insisting  upon  this  early  deposit  of  the 
data,  the  net  outcome  of  that  seems  to  be  that  that  filing  is  not 
going  on. 

To  our  knowledge,  none  of  the  genome  centers  are  filing  for  intel- 
lectugJ  property  protection.  They  just  don't  have  time  and  their 
goal  is,  really,  to  get  the  data  out  there  so  that  other  scientists  can 
figure  out  what's  there.  So,  they  are  pouring  out  data  every  day  of 
this  sort  for  the  rest  of  the  scientific  community  to  use,  to  analyze, 
to  try  to  figure  out.  Is  there  a  cancer  gene  in  yesterday's  output 
from  the  St.  Louis  center?  Is  there  a  diabetes  gene  in  the  day-be- 
fore-yesterday's  output  from  Maynard  Olson's  Center  at  the  Uni- 
versity of  Washington?  It  takes  another  set  of  steps  to  figure  that 
out. 
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The  sequence  itself  is  publicly  accessible.  It  is  truly  in  the  public 
domain.  "Public  domain"  is  usually  reserved  to  say  there  has  been 
no  intellectual  property  placed  upon  this,  so  the  sequence  is  both 
publicly  accessible  and  it  is  in  the  public  domain.  Now  future  inves- 
tigators, who  figure  out  the  value  of  a  particular  gene  sequence, 
may  learn  that  it  causes  a  particular  disease  or  learn  that  it  can 
be  turned  into  a  pharmaceutical,  and  then  may  decide  that  they 
have  added  enough  value  to  that  to  meet  the  criteria  of  novelty, 
nonobviousness,  and  utility  and  file  a  patent  on  it.  Those  investiga- 
tors might  be  in  academia  or  they  might  be  in  companies,  and  the 
Patent  and  Trademark  Office  then  decides  whether  they've  made 
a  convincing  case  or  not. 

Mr.  ROEMER.  Thank  you.  I  think  each  time  you  ask  a  question, 
it  begs  some  more  questions.  It's  been  a  fascinating  panel  and 
you've  been  very  helpful  and  I  hope  we  can  do  another  panel  like 
this  and  add  to  some  more  questions.  And  I  appreciate  the  Chair- 
man, your  foresight  in  having  this  hearing  today. 

Chairman  Calvert.  Thank  you,  Mr.  Roemer. 

CONSEQUENCES  OF  PRIVATE-SECTOR  VENTURE  FOR  FEDERAL  HUMAN 

GENOME  PROJECT 

Chairman  Calvert.  I  have  just  a  quick  question  for  Dr.  Olson. 
Obviously  you  are  a  skeptic  when  it  comes  to  the  private  sector  ini- 
tiative described  here  today.  If  this  project  is  likely  to  fail,  in  your 
estimation,  should  we  just  ignore  it  and  continue  the  federal  pro- 
gram that  we  have  today  unchanged? 

Mr.  Olson.  Well,  I  want  to  make  clear  that  failure  is  a  relative 
term.  I  have  emphasized  that  I  believe  it  will  produce  a  huge 
amount  of  extremely  useful  data.  I  don't  believe  that  it  will  meet 
the  quality  standards  which  have  been  outlined.  And  I  think  that 
the  federal  program  would  be  well  advised  over  the  next  2  or  3 
years  to  concentrate  on  defining  the  cost-benefit  tradeoffs  associ- 
ated with  the  high-quality  sequence  product.  No  known  approach 
is  going  to  produce  a  perfect  product.  Indeed,  perfect  is  not  well- 
defined  in  the  context  of  intrinsically  variable  structure  like  the 
human  genome,  but  I  believe  that  the  federal,  the  unique  niche  for 
the  federal  program  over  the  next  few  years  is  to  refine  the  meth- 
ods that  are  required  to  produce  the  best  available  product  that  can 
be  achieved  at  a  reasonable  cost,  and  I  would  define  a  reasonable 
cost  as  roughly  current  levels  of  funding. 

One  of  the  difficulties  in  this  highly-collaborative  model,  which  is 
certainly  correct  in  principle,  but  a  technical  point  about  the  pro- 
posed Perkin-Elmer  strategy  is  that  it  is  heavily  back  loaded  in 
terms  of  answering  my  concerns.  Even  a  simple  theoretical  analysis 
of  this  approach  to  sequencing  the  genome,  indicates  that  particu- 
larly this  issue  of  gaps,  will  only  be  addressable  relatively  late  in 
the  project.  One  simply  can't  tell  from  the  early  indicators  how  that 
issue  is  going  to  go. 

So,  I  believe  the  federal  project  should  focus  on  the  high  quality 
and  the  definition  of  high  quality,  the  exploration  of  the  cost/bene- 
fit issues,  the  demonstration  that  by  fail-safe  methods  we  can 
produce  such  data  over  the  next  few  years  and  when  this  rather 
back  loaded  information  comes  to  us  from  this  initiative  or  other 
initiatives,  all  I  can  really  say  is  that  we  will  look  at  it  very  closely 
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and  I'm  certainly  pleased  to  hear  these  renewed  strong  assurances 
that  we'll  be  able  to  look  at  it.  That  is  the  data  that  will  be  there. 
Chairman  Calvert.  Thank  you,  Doctor. 

EFFICIENCY  OF  FEDERAL  HUMAN  GENOME  PROGRAM 

Chairman  Calvert.  Dr.  Galas,  you've  got  some  experience  in 
government,  now  in  the  private  sector.  How  would  you  evaluate  the 
efficiency  of  the  government  program  and  their  ability  to  make 
changes  as  technology  improves? 

Mr.  Galas.  I,  actually  I  think  that  the  human  genome  program, 
perhaps  because  of  the  fact  that,  unlike  most  federally-supported 
programs  there's  internal  competition  of  a  friendly  type  within  the 
program  having  two  agencies  running  it  actually  has  been  very  re- 
sponsive in  being  able  t(?  take  advantage  of  new  technologies.  With 
the  DOE  and  the  NIH  looking  over  each  other's  shoulders,  I  think 
actually  the  human  genome  program  has  done  reasonably  well  in 
that  regard.  I'm  sure  it  could  be  improved  and  I'm  sure  they  are 
constantly  looking  at  how  to  do  so,  but  I  think  they  can  take  ad- 
vantage of  that. 

I  would  say  that,  if  I  might  address  some  of  the  comments  that 
Dr.  Olson  just  made,  I  think  that  in  fact  there  probably  does  exist 
a  strategy  that  would  be  a  different  strategy  from  what  is  being 
right  now  in  the  program.  Maybe  only  slightly  different,  but  dif- 
ferent nonetheless,  that  does  not,  on  the  one  hand,  depend  entirely 
on  the  success  or  the  back  loaded  success  of  the  private-sector  pro- 
gram but  can  take  advantage  of  data  as  it's  released  from  this  pro- 
gram and  enhance  the  federal  effort,  but  not  depend  on  the  success 
of  the  private  program,  but  merely  be  accelerated  by  it  if  it  does 
succeed.  And  I  think  that's  what  the  federal  program  should  focus 
on,  rather  than  focusing  on  the  downstream,  final  product  which  I 
think,  quite  frankly,  that  Dr.  Olson  makes  when  he  talks  about  se- 
quence quality  on  the  one  hand  and  scientific  standards  on  the 
other,  they  are  not  equivalent  at  all.  Those  are  really  not,  that's  an 
inequality  that  can't  be  made  I  think. 

I  think  there's  a  rational  strategy  in  there  which  does  have  a 
continually  improving  quality  of  sequence,  or  a  staged  quality  of  se- 
quence that  would  get  some  of  the  fundamental,  really  important 
biological  data  out  sooner  and  benefit  us,  be  able  to  take  advantage 
of  what  data  is  released  by  the  private  sector  without  making  any 
assumptions  about  either  the  quality  or  whether  or  not  they'll  suc- 
ceed. 

Chairman  Calvert.  Thank  you. 

Mrs.  Lee.  No  questions?  Any  other  questions  from  the  panel? 

I  want  to  thank  our  witnesses  for  very  interesting  testimony  amd 
answers  to  our  questions.  I  think  you  can  rest  assured,  I  doubt 
very  much  if  Congress  will  cut  funding  on  the  Human  Genome 
Project  and  we  look  forward  to  a  successful  conclusion  and  cer- 
tainly. Doctor,  we  wish  you  well  in  your  new  venture.  Thank  you. 

[Whereupon,  at  2:40  p.m.,  the  hearing  was  adjourned.] 

[The  following  material  was  received  for  the  record.] 
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APPENDIX  1:  Answers  to  Post-Hearing  Questions  Submitted  by  Members  of 
the  Subcommittee  on  Energy  and  Environment 
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U.S.  Department  of  Energy 

Washington,  DC 

Post-Hearing  Questions  Submitted  by  Chairman  Calvert 

Scientific  Justification  for  Completing  Government-Funded  Sequencing  of  Entire  Human 
Genome 

Ql.  Critics  of  the  government  program  say  that  sequencing  the  entire  human  genome  is 
a  waste  of  the  taxpayer's  money.  Please  explain  why  it  is  scientifically  necessary  to 
complete  the  entire  process. 

Al.  We  estimate  that  the  human  genome,  approximately  3  billion  bases  of  DNA,  contains 
about  80,000  genes.  It  has  been  estimated  that  the  DNA  sequence  (cDNAs)  containing 
the  specific  instructions  for  making  these  80,000  protein  products  may  occupy  only  about 
3%  of  the  total  genome.  While  the  specific  role  for  the  remaining  97%  of  the  genomic 
sequence  is  unknown  at  this  time  there  is  no  way  at  present  to  reliably  recognize  in 
advance  those  components  that  we  need  to  sequence.  Even  if  we  could  physically 
recognize  the  important  sequences  there  is  no  method  to  select  out  in  an  economical  way, 
those  parts  that  are  biologically  significant  for  sequencing.  Merely  sequencing  the 
expressed  cDNAs  certainly  won't  deliver  the  needed  information  to  understand  human 
biology — on  this  there  is  very  strong  agreement  fi^om  the  research  community.  For 
example,  essentially  all  of  the  information  that  is  critical  for  the  proper  regulation  of  genes, 
information  vital  to  the  proper  "turning  on"  and  "turning  ofiP'  of  genes  so  that  they 
become  operational  at  the  right  times  and  in  the  right  cells  is  not  recovered  in  the 
expressed  cDNAs.  Damage  in  these  regulatory  regions  has  been  shown  to  be  an 
important  cause  of  genetic  disease  in  humans. 
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We  can  and  must  do  the  best  job  we  can  to  prioritize  what  we  sequence  so  that,  in  our 
estimation,  we  are  getting  the  best  value  for  the  money.  However,  we  need  to  know  the 
entire  sequence  to  fully  explore  the  complexity  of  human  biology  and  fully  exploit  the 
information  in  the  human  genome. 

EfTiciencies  of  DOE's  Joint  Genome  Initiative  vs.   Three  DilTerent   DOE  Laboratory 
Programs 

Q2.  In  your  testimony,  you  describe  the  Joint  Genome  Initiative,  which  allows  for  joint 
management  and  oversight  of  three  different  laboratory  programs,  those  at 
Lawrence  Berkeley,  Lawrence  Livermore  and  Los  Alamos.  The  JGI  was 
implemented  seven  years  into  the  program.  Were  there  inefficiencies  and  higher 
costs  as  a  result  of  separate  management  of  the  three  labs'  programs  and,  in 
hindsight,  would  it  have  been  better  if  joint  management  existed  from  the  beginning 
of  the  program? 

A2.  The  first  phase  of  the  Human  Genome  Program  (HOP),  closely  coordinated  between  the 
DOE  and  the  NIH  was  the  phase  of  exploration  requiring  many  independent  pursuits. 
Also,  it  was  necessarily  devoted  to  laying  the  groundwork  for  the  intensive  sequencing 
effort  that  has  begun  in  the  last  couple  of  years.  In  1990,  at  the  start  of  the  HOP, 
sequencing  technologies  were  not  advanced  enough,  nor  efficient  enough,  to  accomplish 
the  task  of  sequencing  3  billion  base  pairs  at  the  expected  funding  levels  and  in  the 
expected  time  frame.  Additionally,  large  scale  chromosomal  mapping  efforts  were 
undertaken  to  provide  the  detailed  physical  maps  that  it  was  thought  would  be  critically 
necessary  to  achieving  the  complete  genome  sequence.  Each  of  the  three  DOE  Lab 
genome  centers  carried  out  parallel  and  non-overlapping  research  eflforts  to  map  different 
chromosomes  and  to  explore  technologies  that  would  accelerate  the  sequencing.  Not  until 
the  genome  project  was  ready  to  switch  directions  to  full  scale  production  sequencing, 
was  the  nature  of  the  task  such  that  issues  of  critical  mass,  economies  of  scale,  and 
sharpness  of  focus  together  made  central  management  the  correct  paradigm. 
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Post-Hearing  Questions  Submitted  by  Democratic  Members 

Difference  Between  the  DOE-NIH  and  "Shotgun"  Human  DNA  Sequencing  Approaches 

Ql.  How  does  the  DOE-NIH  approach,  projected  to  be  completed  by  the  year  2005, 
differ  from  the  Venter-Perkin-Elmer  plan  to  use  the  "shotgun"  method  to  sequence 
the  human  genome  in  three  years? 

Al.  The  DOE/NIH  commitment  is  to  produce  a  complete  and  accurate  image  of  the  human 
genome  by  2005.  In  the  first  2  years  (FY  1997  and  FY  1998)  of  the  production  effort,  the 
approach  taken  insisted  on  full  sequencing  accuracy,  high  continuity,  and  detailed  mapping 
(location)  knowledge  every  step  of  the  way,  in  part  to  ensure  that  these  meritorious 
standards  could  be  achieved  at  affordable  cost.  This  assurance  now  being  in  hand,  DOE  is 
considering  an  approach  that  we  produce  an  intermediate  draft  version  of  the  genome 
based  on  a  "mapped  clone  shotgun  method" — in  contrast  to  the  "whole  genome  shotgun 
method"  being  followed  by  Venter-Perkin-Elmer.  In  the  mapped  clone  shotgun,  in  which 
we  shotgun  sequence,  but  only  within  already  mapped  clones  that  are  about  1/20,000  the 
size  of  the  genome,  we  can  have  a  much  higher  assurance  of  positional  and  sequence 
assembly  validity  than  the  Venter-Perkin-Elmer  method.  In  practice,  the  two  approaches 
will  complement  each  other  and  be  extremely  usefiil  to  the  scientific  community. 

Role  of  DOE  and  NTH  in  Collaboration  with  Private-Sector  Venture 

Q2.  Do  you  see  a  role  for  DOE  and  NHI  to  collaborate  with  Venter  and  Perkin-Elmer  to 
complete  sequencing  of  the  human  genome? 

A2.  Yes,  a  very  significant  opportunity  exists.  In  practical  and  scientific  terms,  the  two 
approaches  can  strongly  and  synergistically  complement  each  other.  In  fact,  the  clone 
resources  that  Venter-Perkin-Elmer  will  utilize  have  been  developed  and  made  available  to 
the  pubhc  by  DOE  and  NIH;  and  the  DOE  is  fiinding  projects  that  will  provide  the 
sequence  information  fi-om  the  ends  of  600,000  BACs  (bacterial  artificial  chromosomes) 
that  will  form  the  scaffold  needed  for  linking  the  human  genome  sequence  together  in  the 
Venter-Perkin-Elmer  Plan.  The  DOE  and  NIH  will  help  both  private  and  the  public 
sequencing  efforts  by  aggressively  completing  the  BAC-end  sequence  set,  as  well  as 
developing  a  high  resolution  radiation  hybrid  map  of  BAC  ends  and  other  sequence 
markers,  and  the  mapping  of  all  cDNA  ESTs  (Expressed  Sequence  Tags)  against  the  BAC 
libraries  being  sequenced. 

Q2.1.    How  would  this  be  done? 

A2. 1 .  On  the  Venter-Perkin-Elmer  side,  prompt  and  complete  sharing  of  their  raw  data 
with  the  public  is  the  core  requisite  of  making  the  two  efforts  mutually 
complementary.  On  the  public  side,  it  is  necessary  that  DOE  and  NIH 
simultaneously  produce  a  high  quality,  fully  mapped,  draft  ('scaffold')  intermediate 
version  of  the  genome,  on  top  of  which  the  Venter-Perkin-Elmer  sequence  could 
most  usefully  be  assembled  (adding  depth  for  improved  accuracy  and  coverage). 
The  public  effort  would  then  proceed  to  complete  this  jointly  constructed  draft 
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version  to  full  coverage  and  accuracy  sooner  than  originally  planned  and  at  a  lower 
cost. 

Q2.2.   At  what  stage  would  it  be  done? 

A2.2.  The  Venter-Perkin-Elmer  venture  has  projected  a  completion  date  of  two  to  three 
years;  thus,  to  be  effective,  any  collaborative  elements  need  to  be  in  place  quickly 
and  ongoing  during  the  course  of  the  project.  As  mentioned  above,  some  of  the 
needed  efforts  are  already  underway  and  it  is  anticipated  that  the  remaining 
components  will  be  initiated  before  January  1999. 

Concerns  oflntemational  Collaborators  About  Intellectual  Property  Rights  and  Patenting 

Q3.      The  international  Human  Genome  Organization  (HUGO)  has  been  fairly  vocal 
about  their  feelings  concerning  intellectual  property  rights  and  patenting. 

Q3.1.  How  have  the  international  collaborators  responded  to  this  proposed 
venture? 

A3.1.  With  very  serious  concern.  These  concerns  derive  from  the  immense  and 
essentially  unrestrained  possibilities  that  exist  for  intellectual  property  rights 
control  when  extremely  high  rate,  highly  automated  data  generation  techniques  are 
used  by  a  privately  owned  company  to  produce  and  combine  both  "composition  of 
matter"  information  (sequence  data)  with  "utility"  information  (e.g.,  mapping  and 
gene  expression  data),  to  form  the  basis  of  patent  applications  en  masse.  Thus,  the 
response  to  this  venture  by  the  Wellcome  Trust  in  Great  Britain,  the  principal 
public  funder  of  human  genome  sequencing  efforts  at  the  Sanger  Center  in  Britain, 
was  to  announce  that  they  would  double  the  budget  in  support  of  human  genome 
sequencing  at  the  Sanger  Center.  The  Sanger  Center,  like  its  US  counterparts  has 
a  policy  of  daily  release  of  sequence. 

Q3.2.  How  do  you  plan  to  allay  their  concerns  that  the  race  for  patenting  will  (1) 
hinder  information  exchange  and  (2)  result  in  unnecessary  and  costly 
duplication? 

A3. 2.  (1)  The  DOE  and  NIK  must  not  deviate  from  their  clearly  stated  policy,  elaborated 
at  a  series  of  meetings  of  the  heads  of  sequencing  programs  and  large  sequencing 
labs  in  the  US  and  other  countries,  of  nightly  electronic  release  of  newly 
determined  human  sequence,  without  any  restrictions  on  availability. 

The  Venter-Perkin-Elmer  group  has  publicly  stated  that  the  vast  majority  of 
sequence  information  that  they  determine  will  be  deposited  in  public  databases 
within  a  few  months  of  sequencing.  The  several  hundred  genes  that  they  say  they 
will  focus  on  represents  much  less  that  one  percent  of  all  human  genes.  Thus 
information  exchange  for  the  vast  majority  of  human  genes  should  not, 
theoretically  be  compromised  by  this  private  sequencing  effort.  Similariy,  there 
should  not  be  a  costly  race  for  patenting  for  >99%  of  the  human  genes  simply  as  a 
result  of  this  one  private  effort.    It  should  not  be  surprising,  however,  that  "use 
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patents"  for  human  genes  may  become  a  significant  issue  when  large  numbers  of 
human  genes  are  finally  identified,  whether  by  private  or  public  methods. 

(2)  As  mentioned  earlier,  the  Venter-Perkin-Elmer  genome  sequencing  efforts  are 
seen  by  DOE  as  complementary  and  not  duplicative  of  the  public  efforts  by  the  US 
public  Human  Genome  Program.  With  regard  to  the  public  efforts,  the  Human 
Genome  Organization  (HUGO)  is  coordinating,  through  a  Web  site,  a  current  view 
of  which  centers/labs  are  sequencing  which  human  chromosomes  or  chromosome 
fragments.  This  site  is  accessible  to  anyone  via  the  Web.  The  purpose  of  this 
HUGO  effort  is  to  minimize  duplication  among  publicly  funded  sequencing  efforts. 
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Post-Hearinfi  Ouestions  Submitted  from  Chairman  Ken  Calvert 

Scientific  Justification  for  Completing  Government-Funded  Sequencing  of  Entire  Human 
Genome 

Ql.  Critics  of  the  government  program  say  that  sequencing  the  entire  human  genome  is 
a  waste  of  the  taxpayer's  money.  Please  explain  why  it  is  scientifically  necessary  to 
complete  the  entire  process. 

Al .  The  more  we  study  DNA,  the  more  we  understand  how  it  carries  out  its  amazing  work. 
Genes  affect  almost  all  important  biological  processes,  at  least  in  part.  This  includes 
those  processes  that  lead  to  or  are  involved  in  disease  By  identifying  the  gene(s) 
associated  with  a  disease,  we  will  gain  important  understanding  that  can  help  us  develop 
therapies  or  preventive  strategies.  The  Human  Genome  Project,  including  sequencing  the 
entire  human  genome,  is  designed  to  speed  up  the  process  of  gene  identification  and  make 
it  much  more  cost-efficient.  Genes,  we  have  learned,  are  made  up  of  several  parts  that 
control  their  activity.  Sometimes  all  the  parts  are  clustered  in  the  same  DNA 
neighborhood,  but  other  times,  the  parts  may  be  scattered  far  apart  from  each  other  Also, 
at  times  mistakes  in  DNA  spelling  in  regions  thought  to  be  of  no  importance  turn  out  to 
contribute  to  disease  risk.  We  already  have  found  such  examples  for  cancer,  diabetes,  and 
osteoporosis  Some  important  parts  are  very  easy  to  spot  and  some  aren't  Knowing  all  of 
the  parts  of  a  gene  is  critical  to  understanding  how  it  works.  Many  of  the  other 
approaches  to  gene  identification  that  have  been  used  so  far  cannot  find  all  of  the  parts  of 
every  gene  (that  is  one  reason  why  these  other  approaches  tend  to  be  somewhat  faster  and 
appear  to  be  less  expensive)  Having  a  complete  genome  sequence  is  the  only  way  to  find 
all  of  the  parts  of  all  of  the  genes  that  may  affect  human  health  The  Human  Genome 
Project  will  provide  a  truly  complete  genome  sequence  containing  no  gaps  That  level  of 
completeness  we  believe  is  necessary  to  provide  researchers  with  the  best  possible  tool  for 
understanding  the  function  of  genes  and  their  role  in  human  heahh  and  disease. 
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Post-Hearing  Questions  Submitted  by  Democratic  Members 

Difference  Between  the  DOE-NIH  and  "Shotgun"  Human  DNA  Sequencing  Approaches 

Ql.  How  does  the  DOE-NHI  approach,  projected  to  be  completed  by  the  year  2005, 
differ  from  the  Venter-Perkin  Elmer  plan  to  use  the  "shotgun"  method  to  sequence 
the  human  genome  in  three  years? 

Al .  Sequencing  was  once  done  by  hand  as  a  series  of  chemical  reactions — a  slow  and  costly 
method.  Now,  machines  can  read  the  sequence  quicidy,  but  current  instruments  can  only 
read  short  DNA  fragments  at  a  time.  So,  using  a  strategy  referred  to  as  "shotgun" 
sequencing,  an  investigator  randomly  cuts  DNA  into  small  fragments.  These  fragments 
are  small  enough  for  sequencing  machines  to  read.  Then,  the  scientist  must  correctly 
reassemble  all  of  these  sequenced  fragments  in  order  to  properly  reconstruct  the  full- 
length  DNA  sequence.  The  reassembly  of  this  giant  puzzle  is  carried  out  largely  by  highly 
skilled  scientists  using  sophisticated  computer  programs. 

The  sequencing  strategy  the  public  genome  project  uses  employs  shotgun  sequencing  of 
DNA  fragments  that  have  been  careftjUy  mapped  and  catalogued  This  strategy  is  designed 
to  maximize  the  accuracy  of  reassembling  the  sequenced  fragments,  because  the  scientist 
knows  where  the  fragments  belong.  Even  so,  the  scientists  periodically  encounter  DNA 
regions  that  are  particularly  difficult  to  sequence,  and  which  therefore  require  special 
attention.  Because  all  the  fragments  have  been  catalogued,  a  scientist  can  return  to  these 
difficult  spots  after  most  of  the  genome  has  been  sequenced  and  assembled  to  work  on 
closing  the  gaps  and  strengthening  the  weak  areas  so  that  the  entire  sequence  will,  in  the 
end,  be  finished  to  very  high  quality.  The  international  sequencing  community,  whose  goal 
is  to  complete  the  human  DNA  sequence  by  2005,  has  agreed  to  a  policy  of  releasing 
completed  sequence  every  24  hours  into  a  free,  publicly-accessible  database.  More  than  10 
percent  of  the  human  sequence  is  now  available  in  a  public  database,  and  about  half  of  that 
is  already  "finished." 

The  sequencing  strategy  proposed  by  scientists  at  Perkin-Elmer,  Inc  and  Dr.  Venter  also 
employs  shotgun  sequencing,  but  differs  from  the  public  effort  in  several  significant  ways. 
First,  that  strategy,  called  "whole-genome  shotgun  sequencing",  employs  fragments  that 
have  not  been  previously  mapped  or  catalogued  Because  the  scientist  does  not  know 
where  in  the  morass  of  3  billion  base  pairs  the  fragment  might  belong,  the  task  of 
reassembling  the  fragments  becomes  far  more  difficult.  Many  believe,  this  difficulty  in 
reassembly  will  inevitably  lead  to  many  gaps  and  misassembled  regions  in  the  sequence. 
These  scientists  believe  that,  on  its  own,  the  quality  of  the  "whole  genome  shotgun 
sequence"  will  not  be  as  high  as  that  planned  for  the  publicly-fianded  sequence.  For 
example,  when  a  scientist  encounters  a  fragment  that  is  particularly  difficult  to  sequence, 
he  or  she  will  not  be  able  to  return  to  the  fragment  later  because  it  has  not  been 
catalogued.  The  Perkin-Elmer- Venter  approach  does  not  propose  to  fill  in  all  the  gaps  left 
by  these  unsequenced  fragments,  thereby  creating  a  product  that  may  be  incomplete  for 
many  research  uses.  Not  having  a  sequence  of  the  highest  quality  will  be  a  serious  problem 
when  the  gaps  and  errors  occur  in  DNA  regions  with  biological  significance. 
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In  addition,  release  of  sequence  data  from  the  Perkin-Eimer- Venter  effort  will  occur 
quarterly,  rather  than  daily.  Although  the  company  states  that  sequence  will  be  made 
public,  release  will  be  significantly  slower  than  data  release  from  the  publicly-flinded 
effort  As  a  result,  the  larger  research  community's  access  to  this  valuable  data  will  be 
slowed  down.  Furthermore,  the  new  company  maintains  the  right  to  patent  the  most 
biologically  important  gene  data. 

Role  of  DOE  and  NIH  in  Collaboration  with  Private-Sector  Venture 

Q2.  Do  you  see  a  role  for  DOE  and  NEH  to  collaborate  with  Venter  and  Perkin-Elmer  to 
complete  sequencing  of  the  human  genome? 

Q2.1     How  would  this  be  done? 
Q2.2    At  what  stage  would  it  be  done? 

A2  Partnership  with  the  private  sector  is  both  necessary  and  desirable  and  we  welcome  this 
new  initiative  by  Perkin-Elmer  and  Dr.  Venter  In  the  year  ahead,  we  will  look  carefully  at 
the  ways  in  which  this  private  initiative  and  the  publicly-funded  effort  can  be 
complementary  If  need  be,  the  federal  effort  is  fully  prepared  to  adjust  its  strategy  In 
fact,  in  late  May,  just  weeks  after  the  private  sector  announcement,  there  was  a  meeting 
involving  more  than  100  scientists  from  various  fields  and  from  both  the  public  and  private 
sectors,  to  look  at  the  next  five  years  of  the  genome  project.  The  subject  of  how 
collaboration  might  occur  and  whether  or  not  the  publicly-funded  effort  should  revise  its 
strategy  was  intensely  discussed.  I  think  it  is  fair  to  say  there  is  not  yet  complete 
unanimity  on  the  answer  to  those  questions.  The  Perkin-Elmer/Venter  proposal  is  a 
scientific  experiment;  we  like  that.  Scientists  are  energized  by  the  opportunity  to  see  a 
new  approach  tried  out  It  will  take  time,  at  least  12  to  18  months,  to  develop  enough 
data  to  allow  the  usefulness  of  the  approach  to  be  evaluated,  and  to  assess  the  quality  of 
the  product,  but  that  is  what  science  is  all  about 

Concerns  of  International  Collaborators  About  Intellectual  Property  Rights  and  Patenting 

Q3.  The  international  Human  Genome  Organization  (HUGO)  has  been  fairly  vocal 
about  their  feelings  concerning  intellectual  property  rights  and  patenting 

Q3.1     How    have   the    international    collaborators    responded    to    this    proposed 

venture? 
Q3.2     How  do  you  plan  to  allay  their  concerns  that  the  race  for  patenting  will  (1) 

hinder  information   exchange  and   (2)   result   in   unnecessary   and   costly 

duplication? 

A3.  On  May  13,  1998,  the  Wellcome  Trust  announced  their  intent  to  increase  its  support  of 
British  science  in  the  sequencing  of  the  human  genome  Previously,  the  Wellcome  Trust 
had  committed  to  funding  the  sequencing  of  one  sixth  of  the  human  genome  at  the  Sanger 
Centre  in  the  United  Kingdom.  The  May  1 3  announcement,  doubled  that  commitment  to 
one  third  of  the  genome  and  expressed  concern  with  regard  to  a  number  of  aspects  of  the 
private  sector  initiative.  In  the  press  release  accompanying  the  announcement,  the 
Wellcome  Trust  stated: 
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"The  Wellcome  Taist  has  today  announced  a  major  increase  in  its  flagship 
investment  in  British  science  in  the  sequencing  of  the  human  genome... 
The  Trust  is  concerned  that  commercial  entities  might  file  opportunistic 
patents  on  DNA  sequence.  The  Trust  is  conducting  an  urgent  review  of 
the  credibility  and  scope  of  patents  based  solely  on  DNA  sequence... 
This  week  a  commercial  venture  announced  its  intention  to  produce 
partial  sequence  of  the  human  genome,  to  delay  release  of  this 
information  and  to  have  exclusive  rights  to  patent  some  of  these 
sequences  ..  The  Wellcome  Trust  believes  that  the  human  genome 
should  be  sequenced,  through  an  international  collaboration,  as  speedily 
and  accurately  as  possible,  with  the  results  being  placed  immediately  in 
the  public  domain." 

The  Wellcome  Trust  is  the  leading  European  funder  of  human  genome  sequencing  Its 
early  support  of  work  in  the  field  has  enabled  Dr.  John  Sulston,  Director  of  the  Sanger 
Centre,  and  his  colleagues,  to  generate  one  third  of  all  the  human  sequence  which  had 
been  produced  at  the  time  of  the  May  13  announcement 

With  regard  to  patenting,  this  is  a  difficult  area  that  does  not  lend  itself  to  simple  answers 
The  way  the  publicly-funded  effort  in  the  United  States,  which  includes  HGP  grantees 
from  universities  all  over  the  country  and  also  at  the  DOE  labs,  is  going  forward  is  that  we 
have  agreed  with  our  international  sequencing  collaborators  to  deposit  sequence  data 
within  24  hours  of  the  time  it  reaches  at  least  an  assembly  of  2,000  bases,  or  letters,  in  a 
row.  Absent  a  finding  of  exceptional  circumstances,  we  are  not  at  the  NIH  allowed  to 
deny  our  grantees  the  opportunity  to  file  for  intellectual  property  rights  on  things  they 
discover  with  NIH  funds,  because  of  the  Bayh-Dole  Act.  As  a  practical  matter,  however, 
the  pubhcly  supported  sequencing  community  has  agreed  to  a  24  hour  data  release  policy, 
and  we  are  not  aware  that  there  have  been  any  patent  filings. 

Therefore,  the  sequence  itself  is  publicly  accessible  It  is  truly  in  the  public  domain,  which 
usually  is  reserved  to  say  there  have  been  no  intellectual  property  restrictions  placed  upon 
the  data.  So,  future  investigators,  who  figure  out  the  function  of  a  particular  gene 
sequence  and/or  turn  that  sequence  information  into  a  pharmaceutical  or  a  new  diagnostic, 
may  decide  they  have  added  enough  value  to  meet  the  patent  criteria  of  novelty, 
nonobviousness,  and  utility,  and  file  for  a  patent.  Those  investigators  may  be  in  academia, 
here  in  the  United  States  or  abroad,  or  they  might  be  in  private  industry.  But  all  seeking 
patent  protection  must  make  a  case  sufficient  to  convince  the  Patent  and  Trademark 
Office  that  their  discovery  deserves  protection  under  the  law. 
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Federal  Government's  Cost  to  Completely  Sequence  the  Human  Genome 

Q4.  Dr.  Collins,  you  have  indicated  that  to  date,  the  Federal  Government  has  spent 
about  $100  million  on  human  genome  sequencing.  How  much  more  do  you  think  it 
will  cost  the  Federal  Government  to  completely  sequence  the  human  genome  using 
the  federal  sequencing  approach? 

A4.  The  original  projection  was  that  the  entire  Human  Genome  Project,  including  mapping, 
sequencing,  technology  development,  model  organisms,  informatics,  and  ELSI  would  cost 
$200  million  a  year  for  15  years,  for  a  total  of  $3  billion  in  1990  equivalent  dollars.  If  you 
include  the  FY'99  budget  request,  a  total  of  $15  billion  in  1990  dollars  will  have  been 
spent  over  a  9  year  period.  This  is  approximately  $300  million  below  the  $1.8  billion 
originally  projected  for  the  Project  over  the  first  9  years.  So  we  are  significantly  under  the 
projected  cost  of  the  Project 

Up  to  this  point,  the  Project  has  only  spent  about  $100  million  on  human  production 
sequencing  Now  it  is  a  very  critical  question,  what  will  it  cost  the  government  to 
completely  sequence  the  human  genome?  The  difference  between  50  cents  per  finished 
base  and  49  cents  per  finished  base  is  $30  million  worth  of  cost  Greater  reductions  in  the 
per  finished  base  cost  will  yield  more  significant  reductions  in  cost. 

The  NHGRI  has  instituted  a  new  method  of  bringing  together  our  genome  sequencing 
centers  They  have  agreed  to  cooperate  to  share  their  technology  ideas  and  to  figure  out 
who  is  saving  money  and  at  what  step  or  steps  in  the  process.  The  NHGRI  also  will 
continue  to  support  research  to  improve  sequencing  technology  and  reduce  costs. 

I  think  it  is  a  little  hard  to  predict  how  things  will  go  in  the  next  6  or  7  years,  particularly 
with  regard  to  the  impact  on  costs  of  fiirther  developments  in  technology  and  activity  in 
the  private  sector.  But  I  am  very  optimistic  that  the  sequencing  component  of  the  Project 
can  be  accomplished  within  the  projected  budget  To  date,  we  have  met  our  goals  on 
time,  and  under  budget.  I  would  hope  the  Human  Genome  Project  in  the  fiiture  v^ll  be 
judged  by  the  total  budget  that  was  required  to  provide  a  highly  accurate,  publicly 
accessible,  contiguous,  finished  sequence  as  soon  as  possible 


96 


COMMITTEE  ON  SCIENCE 

SUBCOMMITTEE  ON  ENERGY  AND  ENVIRONMENT 

U.S.  HOUSE  OF  REPRESENTATIVES 

Hearing 
on 

The  Human  Genome  Project: 
How  Private  Sector  Developments  Affect  the  Government  Program 

June  17,  1998 

Post-Hearing  Questions  Submitted  to 

Dr.  J.  Craig  Venter 

President  and  Director 

The  Institute  for  Genomic  Research 

Rockville,  MD 

Post-Hearing  Questions  Submitted  by  Republican  Members 

Will  the  Private  Initiative  Duplicate  the  Federal  Human  Genome  Project? 

Ql.  Please  tell  us,  should  your  initiative  be  successful,  will  you  in  fact  have 
duplicated  the  federal  program,  or,  as  some  have  said,  given  us  a  "synopsis" 
of  the  human  genome? 

Al.  By  obtaining  the  complete  DNA  sequence  of  the  human  genome  by  the  year  2000, 
our  new  venture  will  make  the  science  of  genomics  directly  applicable  to 
combating  human  disease  in  the  broadest  way  possible  We  won't  duplicate  the 
federal  program  because  we'll  actually  obtain  the  complete  sequence  and  make  it 
available  before  that  effort  is  complete  We  will,  however,  be  building  our 
program  on  resources  and  strategies  that  have  been  developed  as  a  result  of  the 
federally-funded  initiative.  As  I  indicated  in  my  testimony,  obtaining  the  complete 
sequence  of  the  human  genome  is  not  an  end  to  itself,  but  represents  a  beginning 
for  the  real  research  that  will  allow  us  to  better  understand  the  disorders  that  afflict 
humankind  The  federally-funded  program  needs  to  be  positioned  to  ensure  this 
new  research  takes  place,  whether  in  the  year  2005  as  previously  planned  or  in  the 
year  2000. 
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Concern  About  Rdcasc  of  Data  to  the  Public 

Q2.  In  his  testimony,  Dr.  Francis  Collins  expressed  concern  that  your  plans  to 
release  data  to  the  public  on  a  quarterly  basis  is  not  suflicient.  Please  tell  us 
your  response  to  that 

A2.  As  a  requirement  for  receiving  a  grant  from  either  the  Department  of  Energy  or  the 
NationaJ  Human  Genome  Research  Institute  for  DNA  sequencing  the  recipients 
are  required  to  release  sequence  data  as  soon  after  it  is  generated  as  possible.  This 
is  a  requirement  for  publicly-funded  activities.  As  I  indicated  in  my  testimony,  we 
don't  presume  to  be  able  to  understand  the  biolo^cal  significance  of  all  the  data 
that  we  will  generate  in  completing  the  sequence  of  the  human  genome.  As 
scientists,  we  also  understand  the  importance  of  sharing  data.  The  current  model 
that  is  employed  by  most  commercial  organizations  in  this  field  is  to  keep  human 
DNA  sequence  data  private.  We  intend  to  share  the  data  that  we  generate  on  a 
quarterly  basis.  There  are  obviously  people  and  organizations,  especially  in  the 
pubUc  sector,  who  don't  feel  this  frequency  is  adequate.  However,  we  are  not 
required  to  meet  the  objectives  of  the  publicly-funded  project  and  given  the  current 
commercial  alternative  we  believe  our  approach  is  very  appropriate 

Recommendations  for  Restructuring  the  Federal  Human  G^enome  Project 

Q3.  In  your  testimony,  you  say  the  impact  your  venture  will  have  on  the  federal 
program  will  be  to  re-orient  it  to  focus  on  research  into  the  genetic  impact  of 
disease  on  a  broad  basis.  Could  you  please  elaborate  on  that  and  tell  us  any 
specific  recommendations  you  have  on  how  the  federal  program  should  be 
restructured. 

A3.  The  Human  Genome  Project  is  about  much  more  than  just  obtaining  the  complete 
human  DNA  sequence.  The  sequencing  is  just  the  biggest  initial  hurdle  that  needs 
to  be  cleared.  Once  the  human  sequence  is  complete,  the  information  will  exist  to 
begin  in-depth  research  into  the  actual  functioning  of  the  genetic  code.  One 
critical  resource  that  will  be  required  to  undertake  this  task  will  be  providing 
researchers  access  to  full-length  cDNA  clones.  This  will  allow  researchers  to 
study  specific  genes  in  great  detail  and  at  this  time  there  is  no  resource  for  this 
material.  Only  a  small  percentage  of  the  genome  is  actually  made  up  of  genes,  but 
these  regions  will  attract  a  significant  amount  of  the  initial  research  activity  from 
both  private  and  public  entities.  However,  there  will  be  real  value  in  understanding 
all  aspects  of  the  human  genome,  and  NHGRl  is  a  logical  place  to  undertake  this 
activity. 
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Post-Hearing  Questions  Submitted  by  Democratic  Members 

Availability  of  Genomic  Information  to  the  Scientific  Community 

Ql.  Although  details  of  your  business  venture  with  Perkin-Elmer  Cooperation 
may  not  be  finalized,  you  and  Tony  White,  Chair,  President,  and  Chief 
Executive  Officer  of  Perkin-Elmer,  have  indicated  that  you  intend  to  make 
genomic  information  from  this  venture  available  to  the  scientific  community. 
How  can  we  be  assured  that  this  will  happen? 

Al  On  June  20,  1997,  The  Institute  for  Genomic  Research  (TIGR)  and  Human 
Genome  Science  (HGS)  ended  a  collaborative  arrangement  that  required  TIGR  to 
forego  payments  totalling  $38  million  The  primary  reason  for  my  choosing  to  end 
this  relationship  and  access  to  significant  financial  resources  was  a  philosophical 
disagreement  about  the  public  release  of  DNA  sequence  data.  The  day  after  this 
relationship  was  terminated,  TIGR  made  the  largest  deposit  of  DNA  sequence  data 
into  the  public  domain  in  history.  When  I  entered  negotiations  with  the  Perkin- 
Elmer  Corporation  to  undertake  this  new  venture,  the  first  point  of  agreement  was 
the  requirement  that  human  genome  data  would  be  made  publicly  available  If 
agreement  had  not  been  reached  on  this  point,  we  would  not  be  discussing  this 
new  venture  I  don't  know  of  many  organizations  that  would  forego  $38  million 
to  ensure  that  DNA  sequence  data  would  be  made  publicly  available,  and  this  act 
should  provide  a  high-level  of  comfort  to  you  and  others  that  this  data  will  be 
made  available  to  the  public. 

Timeliness  of  Release  of  and  Compensation  for  Human  DNA  Sequence  Data 

Q2.  Once  obtained,  how  soon  and  for  what  economic  compensation  will  this 
information  be  released  by  your  new  company? 

A2  As  previously  indicated,  the  human  DNA  sequence  data  will  be  made  publicly- 
available  at  no  charge  on  a  quarterly  basis  for  the  scientific  community.  The 
details  and  pricing  models  for  the  new  venture's  products  are  still  being  determined 
at  this  time. 
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Plans  to  Patent  Genomic  Sequences 

Q3.  Obviously,  you  and  the  Perkin-Elmer  Corporation  plan  to  patent  a  number 
of  genomic  sequences. 

Q3.1  Since  the  patenting  criteria  include  utility,  in  addition  to  novelty  and 
unobviousness  to  peers,  will  the  sequences  you  plan  to  patent 
correspond  to  particular  biological  functions  or  genetic  traits? 

Q3.2  Your  past  patenting  attempts  involved  these  expressed  sequence  tags 
(ESTs)  you  discussed  in  your  testimony.  To  the  best  of  my  knowledge, 
these  requests  were  denied.  Could  you  explain  to  me  (1)  why  that  was 
and  (2)  what  in  your  current  EST  strategy  will  allow  for  the  patenting 
of  these  tags. 

A3  As  you  correctly  noted,  the  NIH  chose  to  file  patents  for  the  ESTs  identified  by  my 
lab.  This  initial  application  was  rejected  and  NIH  chose  not  to  appeal  the  ruling. 
We  are  not  planning  to  seek  patents  on  broad  sets  of  ESTs  similar  to  what  was 
done  at  NIH.  Instead,  we  plan  to  fiilly  characterize  a  small  subset  of  key  genes  for 
which  we  will  seek  to  identify  and  understand  their  biological  significance  In  an 
article  published  in  the  May  1,  1998  issue  of  Science,  John  Doll,  Director  of 
Biotechnology  Examination  at  the  U.S.  Patent  and  Trademark  Office  (PTO), 
indicated  that  the  same  patentability  analysis  which  is  conducted  for  any  other 
application  will  be  conducted  in  the  area  of  genomics  It  is  our  intent  to  satisfy  the 
PTO  standards  for  those  discoveries  on  which  we  seek  to  file  for  patents.  I  have 
attached  a  copy  of  that  article  for  your  information. 

Uniqueness  of  Expressed  Sequence  Tags 

Q4.  How  unique  are  these  tags  in  terms  of  their  ability  to  identify  an  expressed 
gene  or  locate  a  gene  on  a  larger  map  of  the  genome.  Is  it  a  1:1 
correspondence  in  terms  of  ONE  tag  corresponding  to  a  ONE  part  of  the 
genome?  What  does  that  tell  us  about  the  functional  purpose  of  that  gene? 

A4.  There  is  generally  a  1 : 1  to  correspondence  between  an  EST  and  its  location  on  the 
genome  With  regard  to  functionality,  it  depends  upon  what  else  we  know  about 
the  EST  as  to  whether  it  indicates  any  specific  function  For  example,  if  a  human 
EST  matches  a  sequence  from  another  organism  and  there  is  some  function 
associated  with  it,  then  it  is  likely  the  sequence  will  have  a  similar  function  in 
humans. 
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Role  of  DOE  and  NLH  in  Collaboration  with  Private-Sector  Venture 

QS.  What  do  you  see  as  DOE  and  NIH's  role  in  collaboration  with  yourself  and 
Perkin-Elmer? 

QS.l     How  would  this  collaboration  be  done? 
Q5.2    At  what  stage  would  it  be  done? 

A5.  NIH,  DOE  and  the  new  venture  could  establish  the  basis  for  collaboration  nearly 
immediately,  and  to  some  degree  we  already  have.  As  I  indicated  in  my  testimony, 
certain  resources  that  have  been  publicly-funded  like  bacterial  artificial 
chromosomes  (BACs),  will  provide  the  fi^amework  for  assembling  the  genome  data 
that  we  will  generate.  As  v/e  publicly  release  DNA  sequence  data,  this  data  will  be 
available  for  all  DOE  and  NIH  grantees  to  use  in  their  research 

There  are  more  specific  areas  of  collaboration  that  could  be  undertaken  that  have 
been  discussed  on  a  preliminary  basis  One  area  of  particular  significance  that  1 
have  spoken  about  with  Dr.  Varmus  is  that  of  the  ethical,  legal,  and  social 
implications  of  the  genomic  research.  A  number  of  concerns  have  been  raised  in 
the  past  few  years  about  issues  relating  to  genetic  testing,  discrimination  in 
insurance,  and  privacy  of  individual  genetic  information.  These  issues  and  other 
issues  will  only  become  more  important  in  the  coming  years,  especially  as  we 
speed  up  completion  of  the  sequence  of  the  human  genome.  NIH  has  set  aside  a 
portion  of  its  annual  fianding  to  address  these  issues,  and  this  is  an  important  and 
logical  area  for  collaboration.  I  intend  to  follow-up  on  my  conversation  with  Dr. 
Varmus  to  identify  specific  activities  which  we  can  jointly  undertake. 

Restrictions  on  Researchers'  Ability  to  Obtain  Human  DNA  Sequence  Information 

Q6.  What  restrictions  will  be  placed  on  researchers'  ability  to  obtain  this 
information? 

A6.  The  human  DNA  sequence  information  will  be  made  publicly  available  to 
researchers  on  a  quarterly  basis.  There  will  be  no  restrictions  placed  on  this  data 
by  the  new  venture. 
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Relation  of  New  Venture  to  the  Federally-Funded  Human  Genome  Sequencing 
Effort 

Q7.  How  will  you  and  Perkin-Elmer  executives  relate  your  program  to  the 
federally  funded  human  genome  sequencing  effort?  To  the  efforts  of  other 
biotechnology  companies? 

A7  The  new  venture  that  we  are  undertaking,  if  successful,  will  advance  the  efforts  of 
all  human  genome  research  activities.  All  programs  either  publicly  or  privately 
funded  will  gain  some  advantage  by  utilizing  the  information  encoded  in  the  entire 
human  genome.  We  hope  to  work  with  all  researchers  to  improve  understanding 
into  the  genetic  basis  of  disease  and  to  one  day  assist  in  the  creation  of 
therapeutics  that  will  improve  human  health 
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Practical  Value  of  Federal  Completion  of  Entire  Human  Genome  Sequencing 
Process 

Ql.  In  your  testimony,  you  say  that,  even  if  the  Federal  program  agrees  to  the 
"first  draft"  approach  you  recommend,  it  should  then  go  on  to  complete  the 
entire  sequencing  process.  Please  tell  us  the  practical  value  this  will  have. 

Al.       The  "first  draft"  approach  will  make  available  valuable  information  that  can  be 

used  to  locate  genes  and  certain  other  important  tasks  for  projects  currently  being 
pursued  in  the  public  and  private  sectors.  It  is  important,  as  I  testified,  that  this 
information  be  available  as  soon  as  possible  to  help  advance  a  wide  range  of 
present  and  planned  research  work  -  thus  the  value  of  the  "first  draft". 
Researchers  will  use  this  information  to  provide  clues  to  enable  them  to  do  ftirther 
work,  including  more  detailed  sequencing,  in  specific  places  in  the  genome  of 
direct  interest.  In  no  way,  however,  should  this  "first  draft"  be  viewed  as  the  final 
result  of  the  genome  project.  The  complete  sequence  information  is  needed  in  any 
case  to  provide  a  complete  picture  of  the  biological  Sanction  of  the  genome    When 
the  final  product  is  available  in  the  databases  any  fijrther  sequencing  by  researchers 
will  not  be  necessary,  and  even  more  time  and  resources  will  be  saved  than  with 
their  use  of  the  "first  draft"  data 
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Post-Hearine  Questions  Submitted  by  Democratic  Members 

Impact  on  Current  Efforts 

Ql.       How  would  your  current  efforts  be  affected  by  the  joint  venture? 

Al  If  the  joint  venture  succeeds  as  planned,  we  would  welcome  the  new  data  that  will 
be  available  in  the  databases,  and  use  it  as  soon  as  it  is  available  Our  efforts  will 
thus  be  enhanced  by  the  joint  venture. 

Importance  of  Genomic  Data  That  May  Be  Withheld 

Q2.  How  important  do  you  feel  the  100  to  300  sequences  that  would  be  withheld 
are  to  the  broad  assemblage  of  knowledge? 

A2  Since  many  companies  now  withhold  the  results  of  their  own  proprietary  work  on 
genes,  including  their  identity  and  function,  I  doubt  if  this  will  change  the 
landscape  to  a  significant  degree.  I  am  confident  that  any  withheld  genes  will  be 
discovered  in  short  order  in  the  course  of  normal  efforts  by  the  federal  program  or 
by  other  academic  or  industry  researchers  I  would  expect  that  any  gene  withheld 
in  this  way  would  result  only  in  a  short  delay  in  its  availability  to  the  rest  of  the 
community. 

Reasonable  Fees  and  Conditions  to  Private-Controlled  Genetic  Information 

Q3.  Could  you  share  with  the  committee  what  you  feel  are  reasonable  fees  and 
conditions  to  the  genetic  information  Perkin-Elmer  will  control. 

A3.  Unfortunately  it  is  too  early  for  me  to  make  reasonable  estimates  of  this.  It 
depends  on  the  specific  information  (which  is  highly  variable  in  its  value  to  the 
commercial  sector)  and  the  context  of  the  state  of  knowledge  at  the  time  when  it 
would  actually  be  made  available. 
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Rights  of  Individuals — Privacy  and  Compensation  Issues 

Q4.  Please  discuss  rights  of  individuals  whose  specific  genomic  sequences  could 
lead  to  a  commercially  successful  drug?  Are  there  privacy  issues?  Are  their 
fair  compensation  issues? 

A4.  Use  of  individual's  DNA  should  only  be  done  under  fully  informed  consent,  which 
should  include  the  use  of  genetic  information  for  research  purposes  While  there 
are  strong  privacy  issues  that,  in  my  view,  must  be  dealt  with  clearly  and  carefijUy, 
in  my  opinion,  individuals  should  have  no  rights  to  research  information  that  is 
gained  by  using  a  biological  sample  as  part  of  a  research  program.  Any  fliture 
claims  to  completely  unknowable  future  resuUs  that  their  sample  may  be  used  to 
produce  should  be  explicitly  renounced  ahead  of  time  in  the  informed  consent 
process  by  the  individual.  The  advance  of  medical  science  helps  all  of  us  and  our 
future  descendents.  This  is  part  of  the  fair  compensation  for  cooperation  in  a 
research  study  of  any  kind,  including  one  that  involves  genetics. 
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COMMITTEE  ON  SCIENCE 

SUBCOMMITTEE  ON  ENERGY  AND  ENVIRONMENT 

U.S.  HOUSE  OF  REPRESENTATIVES 

Hearing 
on 

The  Human  Genome  Project: 
Haw  Private  Sector  Developments  Affect  the  Government  Program 

June  17, 1998 

Post-Hearing  Questions  Submitted  to 

Dr.  Maynard  V.  Obon 

Professor  of  Medical  Genetics  and  Genetics 

Department  of  Molecular  Biotechnology 

and 

Director,  Genome  Center 

University  of  Washington 

Seattle,  WA 

Post-Hearing  Questions  Submitted  by  Democratic  Members 

Concerns  About  Ability  to  Access  Genomic  Information 

Ql.  Do  you  have  concerns  about  your  ability  to  obtain  access  to  genomic 
information  that  may  come  out  of  this  new  venture?  If  so,  what  are  they? 
Are  you  aware  of  any  past  or  current  problems  in  this  area? 

Al.  I  have  concerns  in  two  areas.  First,  current  promises  about  data  release  cannot  be 
regarded  as  binding  commitments.  The  public  position  taken  by  Perkin  Elmer  is 
that  there  will  be  excellent  access  to  all  the  data  However,  the  business  interests 
of  the  firm  will  be  constantly  re-evaluated  in  the  years  ahead.  Perkin  Elmer  is  fi^ee, 
as  it  should  be,  to  change  its  position  Secondly,  much  of  the  utility  of  the  data  to 
experts  will  depend  on  access  not  just  to  processed  data,  but  also  to  the  raw 
output  fi^om  the  instruments  The  amount  of  raw  data  will  be  vast  and  it  will 
require  pro-active  effort  on  Perkin  Elmer's  part  to  insure  that  these  data  are 
accessible  in  a  readily  analyzed  form  Since  it  is  difficult  to  see  why  Perkin  Elmer 
will  have  any  incentive  to  make  the  needed  effort,  accessibility  is  likely  to  become 
bogged  down  in  haggling  with  federal  agencies  about  who  will  pay  for  and  take 
responsibility  for  the  data  handling  and  whether  or  not  the  cost  is  justified. 
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Impact  on  Current  Efforts 

Q2.       How  would  your  current  efforts  be  affected  by  the  joint  venture? 

A2.  The  answer  to  this  question  depends  on  how  it  goes  Right  now  the  only  effect  is 
that  it  has  generated  inordinate  amounts  of  discussion  for  which  there  is  not  much 
basis  If  the  effort  actually  results  in  quick  delivery  of  a  high-quality  human 
sequence,  it  would  have  a  major  effect  on  my  activities:  I  could  move  on  a  few 
years  earlier  than  planned  to  other  research  goals.  However,  I  will  only 
contemplate  such  a  move  once  I  see  that  the  venture  is  really  fulfilling  the  strong 
claims  that  have  been  made  for  it  My  expectation  is  that  the  venture  will  end  up 
having  only  a  minor  effect  on  my  activities.  Scientists  are  always  making  minor 
adjustments  to  rapidly  changing  external  developments.  It  will  have  more  impact 
on  scientists  who  are  in  the  thick  of  analyzing  particular  problems  in  human 
genetics  (as  opposed  to  engaging  in  large-scale  genome  analysis).  These 
scientists  will  benefit  from  earlier  access  to  valuable  data  than  they  would 
otherwise  have  been  the  case. 

Importance  of  Genomic  Data  That  May  Be  Withheld 

Q3.  How  important  do  you  feel  the  100  to  300  sequences  that  would  be  withheld 
are  to  the  broad  assemblage  of  knowledge? 

A3.  As  long  as  all  the  data  are  released,  as  promised,  and  there  is  no  effort  to  deter 
academic  researchers  from  using  these  data  in  follow-up  studies,  I  am 
unconcerned  about  whether  Perkin  Elmer  attempts  to  patent  100  genes  or  100,000 
genes.  It  is  not  up  to  scientists  to  write  or  interpret  the  patent  law.  I  only  become 
concerned  when  intellectual-property  issues  become  an  obstacle  to  the  free  pursuit 
of  new  knowledge. 

Reasonable  Fees  and  Conditions  to  Priyate-Controlled  Genetic  Information 

Q4.  Could  you  share  with  the  committee  what  you  feel  are  reasonable  fees  and 
conditions  to  the  genetic  information  Perkin-Elmer  will  control. 

A4.  I  assume  that  this  question  concerns  licensing  fees  to  commercial  firms  who  want 
to  use  information  that  is  protected  through  patents  or  copyrights.  I  have  no 
expertise  in  this  area  My  opinion,  expressed  as  that  of  a  scientist  rather  than  an 
expert  in  the  commercial  aspects  of  biotechnology,  is  that  it  does  not  serve  the 
public  interest  for  pharmaceutical  companies  to  confront  a  tangle  of  expensive 
licensing  issues  whenever  they  choose  to  pursue  a  new  product-development 
program.  Most  of  the  real  costs  and  real  difficulies  associated  with  drug 
development  lie  far  downstream  from  DNA  sequencing,  and  the  rewards  of 
successful  drug-development  efforts  should  be  kept  well  aligned  with  the  steps  in 
the  process  that  involve  the  highest  risk  and  require  the  largest  investment. 
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Rights  of  Individuals — Privacy  and  Compensation  Issues 

Q5.  Please  discuss  rights  of  individuals  whose  specific  genomic  sequences  could 
lead  to  a  commercially  successful  drug?  Are  there  privacy  issues?  Are  their 
fair  compensation  issues? 

A5.  This  area  bears  watching.  Certainly,  there  are  privacy  issues  whenever  DNA 
sequences  go  into  databases.  I  believe  that  all  such  data  should  meet  a  high 
standard  of  anonymity,  and  we  should  also  avoid  drifting  toward,  just  as  a  matter 
of  convenience,  obtaining  a  high  proportion  of  human  sequence  fi'om  the  DNA  of 
a  small  number  of  individuals.  In  general,  the  tradition  of  obtaining  research 
samples  from  individuals  who  are  largely  motivated  by  altruism—with 
compensation  that  is  only  related  to  the  time  and  effort  that  they  must  expend  in 
providing  the  samples—serves  the  public  interest  well. 

Biomedical  research  depends  on  ready  availability  of  enormous  numbers  of 
research  samples  acquired  from  patients  and  volunteers,  under  conditions  of 
informed  consent,  every  day.  It  would  not  serve  the  public  interest  to  inject  legal 
contracts  and  commercial  agreements  into  the  relationship  between  research 
subjects  and  researchers.  We  also  do  not  want  to  turn  the  process  into  a  lottery. 
Any  particular  commercially  important  discovery  can  be  traced  to  a  particular 
sample  or  small  number  of  samples,  however,  in  most  cases,  the  individuals  who 
provided  those  samples  are  no  more  deserving  of  special  rewards  than  the 
thousands  of  other  people  who  also  allowed  their  samples  to  be  used  for  similar 
research  purposes 

In  short,  we  should  insist  on  high  standards  of  privacy,  anonymity,  and  informed 
consent  but  should  not  start  a  system  in  which  donors  of  research  samples  have  an 
ongoing  legal  and  commercial  interest  in  the  research  projects  that  employ  their 
samples.  However,  sticky  issues  will  still  arise,  particularly  when  the  special 
commercial  potential  of  a  particular  sample  can  be  recognized  in  advance  of 
extensive  scientific  analysis  or  when  samples  are  collected  in  cultural  settings 
where  the  research  subjects  have  had  little  exposure  to  modem  medicine  or  do  not 
feel  they  benefit  from  advances  in  medical  knowledge.  Nonetheless,  the  more 
closely  we  can  stick  to  a  system  in  which  well  informed  research  subjects  volunteer 
to  provide  research  samples  out  of  altruistic  motives,  the  better  the  public  interest 
will  be  served. 
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APPENDIX  2:  Additional  Materials  for  the  Record 
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em   SCIENCE'S  COMPASS 


policy:  Genomics 


Shotgun  Sequencing  of  the 
Human  Genome 

J.  Craig  Venter.  Mark  D.  Adams,  Granger  G.  Sutton, 
Anthony  R.  Kerlavage,  Hamilton  O.  Smith,  Michael  Hunkapiller 


The  Human  Genome  Project  (HOP)  was 
officially  launched  in  the  United  States  on  1 
October  1990  as  a  15-year  program  to  map 
and  sequence  the  complete  set  of  human 
chromosomes  and  those  of  several  model  or- 
ganisms. The  HGP  is  laying  the  groundwork 
for  a  revolution  in  medicine  and  biology.  Its 
imF>ortance  is  underscored  by  the  level  of 
fundmg  from  the  National  Institutes  of 
Health,  the  Department  of  Energ>'  (DOE), 
the  Wellcome  Trust,  and  other  govern- 
ments and  foundations  around  the  world. 

From  the  inception  of  the  HGP,  major 
technical  innovatior\s  that  would  affect  its 
timetable  and  cost  were  considered  essential 
to  success.  The  development  of  bacterial  ar- 
tificial chromosomes  (BACs)  (/)  provided  a 
key  advance.  BACs  are  propagated  in 
Escherichia  coU  and  can^  large  [- 1 50- 
kilobase  pairs  (kbp)]  inserts  stably.  In  con- 
trast, ordered  cosmid  clones  that  served  as 
the  basis  of  yeast  (2)  and  Caenorhabditts 
ekgans  (3)  genome  sequencing  projects  are 
less  stable  and  much  shorter  (-35  kbp). 
Fluorescent  labeling  oi  DNA  fragments  gen- 
erated by  the  Sanger  dideoxy  chain  termi- 
nation method  has  been  the  mainstay  of  al- 
mobt  all  large-scale  sequencing  projects 
since  the  introduction  of  the  first  semi-auto- 
m.ued  sequencer  by  Applied  Biosystems  in 
1987  and  the  development  of  Taq  cycle  se- 
quencmi:  in  1990.  New  models  of  the  se- 
quencer that  Cim  process  mure  samples.  Taq 
polymerase  engineered  especially  for  se- 
quencing, and  higher  sensitiv  itv  dyes  have 
improved  throughput,  accuracy,  and  operat- 
ing costs.  Publication  oi  the  first  genome 
from  a  seif-replicatmg  organism,  Hatmxo- 
phdus  influenzae,  w-as  based  on  a  whole-ge- 
nome shotgun  (random  sequencing)  method 
{4).  A  set  of  algorithms  called  the  T!GR 
Assembler  (5)  together  with  scaffoldmg  se- 
quences from  both  ends  ot  18-kbp  inserts  in 
bacteriophage  lambda  clones  were  critical 
for  determination  o(  correct  order  and  as- 
sembly. Eight  additional  genomes  have 
since  been  ct>mpleted  bv  these  methods  (4. 

J  C  Venter  u  D  Adams,  G  G  Sutton  A  R 
Kerlavage.  ana  H  O  Smith  are  at  The  institute  'or 
Genomic  fiesearch(TiGR)  Rockviiie.  MO  20850  USA 
M  Hunkapillef  IS  ai  Perkm-Elmef  Applied  BiOSyStems. 
Foster  Oty  CA  94404-1128.  USA 


6,  7),  and  several  others  are  nearing  com- 
pletion, including  genomes  with  high  GC 
(-65%)  and  high  AT  (-82%)  composition, 
which  present  special  problems  for  sequenc- 
ing and  assembly. 

Current  approaches  to  human  genomic 
sequencing  rely  on  building  sequence-ready 
maps  over  regions  ranging  in  si:e  from  hun- 
dreds of  kilobase  pairs  to  whole  chromo- 
somes and  then  sequencing  individual 
BACs  spanning  these  regions  through  a 
combination  of  shotgun  and  directed  ap- 
proaches This  method  can  produce  highly 
accurate  sequence  with  few  gaps,  although 


_^     8AC  ends 


10-kbp  clones 
100  per  100  kbp 


Covering  ihe  genome,  A  lOO-kbp  portion  of  the 
genome  showing  expected  done  coverage 

most  ^icqucncing  centers  have  encountered 
rci:iiinN  tli.it  appear  to  be  unsequence.ible  by 
current  technology.  The  up-front  steps  ot 
butiding  .ind  validating  the  sequence-ready 
ni.ip  and  subclone  library  constniction  and 
the  down^tream  steps  of  directed  gap  tilling 
are  genenilly  considered  to  be  rate  limiting. 
.About  120  Mbp  of  human  genomic  se- 
quence were  completed  through  1997.  and 
another  200  Mbp  are  planned  for  1993. 

The  recent  announcement  by  Perkin- 
Eimer  of  .i  new,  fully  automated  sequencer 
(ABI  PRISM  3700)  permits  a  reevaluation 
ot  strategies  for  completing  the  human  ge- 
nome sequence.  This  instrument  is  a  capil- 
l.ir>-b.i-^'d  -iequencer  that  can  process  - 1000 
samples  per  day  with  minimal  hands-on  op- 
erator time  (-15  mm  compared  wich  -8 
hours  for  the  same  number  oi  samples  on 
ABI  PRISM  577s).  This  reduction  m  oper- 
ating  laK^r.  coupled  with  automation  o( 


sample  purification  and  sequencing  chemis- 
try enabled  by  the  sequencer's  improved  de- 
tection sensitivity,  suggests  that  the  tens  ot 
millions  oi  sequencing  reactions  necessarv 
to  complete  the  human  genome  can  be  per- 
formed more  quickly  and  at  lower  cost  than 
previously  anticipated.  The  Institute  for  Ge- 
nomic Research  (TIGR)  and  Perkin-Elmer 
have  started  a  program  to  complete  this  task 
withm  3  years  using  this  new  technolog> 
and  a  whole-genome  shotgun  strategy-  that 
obvia&es  the  need  for  a  sequence-ready  map 
before  sequencing.  We  intend  to  form  a  new 
company  to  carry*  out  this  venture  and  de- 
velop a  commercial  business  based  on  these 
efforts.  The  cost  of  the  project  is  estimated 
to  be  between  $200  million  and  $250  mil- 
lion, including  the  complete  computational 
and  laboratory  infrastructure  to  develop  the 
finished  sequence  and  informatics  tools  to 
support  access  to  it. 

The  whole-genome  shotgun  strategy  in- 
volves randomly  breaking  DNA  into  seg- 
ments o(  various  si:es  and  cloning  these 
fragments  into  vectors.  The  presence  of  re- 
peat elements,  regions  that  are  unclonable 
in  a  particular  vector,  and  the  benefit  of 
having  more  DNA  a\ailable  in  clones  than 
IS  actually  sequenced  (see  figure  and  table) 
require  that  multiple  vector  libraries  be 
used.  A  library  of  pUClS-based  plasmids 
containing  -2-kbp  inserts  will  provide  most 
of  the  sequencing  templates.  These  clones 
will  be  sequenced  from  both  ends  to  produce 
pairs  ot  linked  sequences  representing  -5C*C 
bp  at  the  ends  of  each  insen.  End  sequences 
trom  a  library  of  low-copy  number  plasmid 
clones  containing  -lO-kbp  inserts  will  pro- 
vide medium-range  linking,  including  span- 
ning the  common  Line-1  and  THE  repeat 
elements.  Use  of  multiple  cloning  systems 
should  help  to  reduce  the  effect  of  sequences 
that  are  unclonable  or  otherwise  not  present 
in  one  ot  the  libraries.  The  goal  is  to  gener- 
ate 70  million  high-quality  DN.A,  sequences 
totaling  -35  billion  bp  (lOx  coverage)  of 
raw  human  sequence. 

An  argument  for  whole-genome  shotgun 
sequencing  of  the  human  genome  was  made 
(8)  and  rebutted  (9)  in  1997.  A  year  later, 
we  see  developments  in  technology  and  a 
new  resource  for  this  project  consisting  of  a 
large  database  of  end  sequences  of  B.AC 
clones.  This  will  provide  a  framework  for 
linking  contigs  over  larger  regioru.  Cur- 
rently, the  DOE  IS  funding  a  program  at 
TIGR  and  the  University  of  Washington  to 
sequence  both  ends  (-500  bp  from  each 
end)  of  300.000  human  BAG  clones.  This 
BAC-«nd  sequencing  strategy  was  origi- 
nally proposed  to  accelerate  genome  se- 
quencing by  providing  markers  every  5  kbp 
throughout  the  genome  ilO). 

The  new  human  genome  sequencing  fa- 
cility will  be  located  on  the  TIGR  campus 
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in  Rockvtlle,  Maryland,  and  will  consist  ot 
230  ABI  PRISM  3700  DNA  sequencers  with 
a  combined  dailv  capacity  of  - 100  Mbp  of 
raw  sequence.  The  facility  will  also  have  the 
infrascnjcture  to  produce  -100.000  template 
preps  and  -200,000  sequencing  reactioru 
daily.  This  includes  both  custom  and  off-the- 
shelf  robotic  devices  for  picking  colonies, 
pipetting,  and  thermal  cycling.  Quality  con- 
trol and  assessment  procedures  will  be  imple- 
mented at  each  stage  of  the  process. 

Accompanying  the  challenge  of  obtain- 
ing the  primary  sequence  data  in  a  rapid  and 
cost-effective  way  is  the  major  challenge  of 
assembling  raw  data  into  contiguous  blocks 
(contigs)  and  assigning  those  to  the  conect 
location  in  the  genome.  Complete  contigu- 
ity of  the  clone  map  should  theoretically  be 
achieved  by  about  9x  coverage,  so  the  46x 
coverage  (sec  table)  allows  for  substantial  de- 
viation from  the  statistical  model.  The  pairs 
of  end  sequences  from  each  template  are  con- 
strained by  the  assembly  algorithms  to  be  di- 
rected toward  one  another  in  the  final  assem- 
bly and  located  at  a  given  distance  apan  de- 
pending on  the  insert  siie  of  the  originating 
library.  Although  the  BAC  end  sequences 
will  be  the  primary  scaffold  onto  which  the 
end  sequences  from  the  smaller  clones  will 
be  assembled,  other  available  resources  will 
be  used  to  verify  the  alignments  and  place 
contigs  on  individual  chromosomes.  The 
most  important  of  these  resources  is  the 
large  number  of  sequence  tagged  site  (STS) 
markers  that  constitute  the  physical  maps 
that  have  been  produced  by  many  laborato- 
ries during  the  first  phase  of  the  HGP.  There 
currently  are  about  45,000  STS  sequences, 
including  about  30,000  that  are  well  ordered 
along  the  chromosomes  and  provide  a  de- 
fined marker  approximately  every  100  kbp 
(li).  Expressed  sequence  tags  (ESTs)  that 
tag  50  to  80%  of  human  genes  {12)  and  full- 
length  cDNA  sequences  spanning  up  to  5 
Mbp  of  genomic  sequence  will  be  used  to 
verify  the  final  assemblies.  There  are  likely 
to  be  contigs  that  are  misassembled  or  incor- 
rectly linked  together  because  of  the  pres- 
ence of  long,  duplicated  segments  of  the  ge- 
nome. We  expect  to  recognize  ar\d  correct 
ambiguous  or  conflicting  assembly  struc- 
tures using  a  combination  of  manual  inspec- 
tion artd  directed  experimental  effort. 

The  aim  of  this  project  is  to  produce 
highly  accurate,  ordered  sequence  that 
spans  more  than  99.9%  of  the  human  ge- 
nome (13).  The  lOx  sequence  coverage 
mearu  that  the  accuracy  of  the  sequence 
will  be  comparable  to  the  standard  now 
prevalent  in  the  genome  sequencing  com- 
munity of  fewer  than  one  error  in  10,000  bp. 
It  is  likely  that  several  thousatKl  gaps  wilt  re- 
mair\.  although  we  caimot  predict  with  con- 
fidence how  many  uncloT\able  or  urue- 
quenccablc  regions  may  be  encountered. 


We  look  forward  to  working  with  other  ge- 
nome centers  to  ensure  that  the  sequence 
meets  the  requirements  of  the  scientific 
community  for  accuracy  and  completeness; 
this  will  include  making  clones  and  electro- 
pherograrm  available. 

An  essential  feature  of  the  business  plan 
15  that  it  relies  on  complete  public  availabil- 
ity of  the  sequence  data.  The  four  primary 
business  areas  are  high-throughput  contract 
sequencing,  gene  discovery,  database  ser- 
vices, and  high-throughput  polymorphism 
screening.  A  major  consequence  of  the 
analysis  of  data  generated  by  this  project 
will  be  the  creation  of  a  comprehensive  hu- 
man genomic  database    It  will  contain  an 
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with  particular  genetic  loci.  The 
assay  systems  will  also  be  marketed  by 
Perkin-Elmer  to  third  parties  for  in-house 
research.  Although  we  do  not  plan  to  seek 
patent  protection  for  the  randomly  selected 
SNPs.  we  may  seek  patents  on  diagnostic 
tests  based  on  the  association  of  particular 
SNPs  with  important  phenotypic  traits. 

We  also  do  not  plan  to  seek  patents  on 
primary  human  genome  sequences.  However, 
we  e.xpect  that  we  and  others  will  be  able  to 
use  these  primary  data  as  a  stanir\g  point  for 
additional  biological  studies  that  could  iden- 
tify and  define  new  pharmaceutical  and  diag- 
nostic targets.  Once  we  have  fully  character- 
ized important  structures  (including,  for  ex- 
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size  (kbp) 


High^copy  plasmid  2 

Low^copy  plasmkl  10 

BAC  150 
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Number  of  Coverage  (■) 

Clones      Sequences       Sequences   Clones 


30.000.000  60.000.000 
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Analysis  o1  cov«r»g«.  As  each  clone  is  not  completely  sequenced,  there  is  a  greater  coverage  oi 
clones  than  sequences  in  the  assembly  We  assume  a  500-bp  average  read  length  and  3.5-Gbp 
genome  size 


exteiuive  set  of  DNA  and  protein  features 
derived  from  the  primary  sequence.  DNA 
features  will  include  identified  genes  and 
their  regulators,  repeats,  lirJcs  with  genetic 
and  physical  mapping  data,  synteny  with 
other  species,  and  polymorphisms.  Because 
of  the  importance  of  this  information  to  the 
entire  biomedical  research  community,  key 
elements  of  this  database,  including  primary 
sequence  data,  will  be  made  available  with- 
out use  restrictions.  In  ifiis  regard,  we  will 
work  closely  with  national  DNA  reposito- 
ries such  as  National  Center  for  Biotechnol- 
ogy Information.  .We  plan  to  release  contig 
data  into  the  public  domain  at  least  every  3 
months  and  the  complete  human  genome 
sequence  at  the  end  of  the  project.  We  also 
envision  providing  at  a  minimum  coi\nect 
fee  online  access  to  these  data  and  many  of 
the  informatics  tools  to  interpret  them.  We 
will  also  market  the  database  system  to  com- 
mercial companies  engaged  in  pharmaceuti- 
cal and  biotechnology  research. 

Because  the  whole-gerwxnc  shotgun  ap- 
proach will  contain  data  from  multiple  irwli- 
viduals  (the  exact  number  has  not  yet  been 
determined),  we  will  generate  a  large  number 
of  precisely  located  single-nucleoride  poly- 
fiHXphic  (SNP)  sites  spaiuuiig  the  genome. 
Using  technology  beirig  developed  at  Perkin- 
Elmer,  we  will  generate  assay  systems  to  vali- 
date these  iTurkers  and  select  a  highly  infor- 
mative set  of  at  least  100,000  SNPs.  We  plan 
to  work  with  commercial  partnen  to  screen 
DNA  samples  associated  with  diseases  or 
other  coruJitions  in  an  effort  to  link  them 


ample,  defining  biological  function),  we  ex- 
pect to  seek  patent  protection  as  appropri- 
ate. Given  both  the  complexity  and  scope  of 
the  information  contained  in  human  ge- 
nome sequence,  as  well  as  its  public  avail- 
ability, we  would  expect  to  focus  our  own 
biological  research  efforts  on  100  to  300 
novel  gene  systems  horn  among  the  thou- 
sands of  potential  targets.  If  we  are  success- 
ful in  these  efforts,  the  patents  would  be 
available  for  liceruing  to  interested  parties. 
Although  it  is  clear  that  shotgun  se- 
quencing at  this  scale  has  never  been  at- 
tempted, it  is  our  hypothesis  that  the  desired 
result  is  achievable-  While  buildir\g  the  hu- 
man genome  sequencir^g  infiastructure  we 
plan  to  attempt  to  demorutratc  the  effec- 
tiveness of  the  shotgun  strategy  on  a  large 
and  complex  genome,  in  collaboration  with 
Gerald  Rubin  (Howard  Hughes  Medical  In- 
stitute/University of  California  Berkeley) 
and  the  Berkeley  Drosopfula  Genome 
Prc>ject  (BDGP).  Dmsophila  melanogasm 
represents  a  good  system  for  testing  the 
whole-genome  shotgun  strategy  because  of 
the  extensive  physical  and  genetic  maps 
that  exist,  the  preseiKie  of  about  12%  of  the 
genome  as  high-quality  finished  sequence 
with  which  to  compare  shotgun  assembly 
results,  3sv6  its  importance  as  a  model  organ- 
ism. We  will  work  fully  with  the  BDGP  to 
Militate  the  final  closure  process  (which 
includes  making  dona  and  electrof>hero- 
granu  available),  with  the  expected  result 
being  a  highly  accurate  and  contiguous  set 
of  chromosome  sequences.  The  Omsophia 
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genome  sequence  will  be  deposited  in 
GenBank  both  while  in  progress  and  at 
completion.  An  international  workshop  is 
being  organized  for  September  1998  to  de- 
velop a  plan  for  completing  the  Drosophila 
genome  that  encourages  panicipation  of  all 
groups  currently  working  on  this  project. 

It  IS  our  hope  that  this  program  is  comple- 
mentary to  the  broader  scientific  efforts  to 
define  and  understand  the  information  con- 
tained in  our  genome.  It  owes  much  to  the 
efforts  of  the  pioneers  both  in  academia  and 
government  who  conceived  and  initiated  the 
HGP  with  the  goal  of  providing  this  intbrma- 
tion  as  rapidly  as  possible  to  the  international 
scientific  community.  The  knowledge  gained 
will  be  key  to  deciphenng  the  genetic  con- 


tnbution  to  imponant  human  conditions 
and  justifies  e.xpanded  government  invest- 
ment in  funher  understanding  of  the  ge- 
nome. We  look  forward  to  a  mutually  re- 
warding partnership  between  public  and 
private  institutions,  which  each  have  an 
important  role  in  using  the  marx-els  of  mo- 
lecular biology  for  the  benefit  of  all. 
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Palaeobiography 

Paul  Copper 


Lite.  A  Natural  History  of  the  First  Four  Billion 
Years  of  Life  on  Eanh.  RICHARD  FORTEY 
Knopf.  New  York,  1998.  xiv.  347  pp..  +  plates. 
$30  or  C$42.  ISBN  0-375-401 1 9-9. 


A  ponentous  book  title  as  bold  as  this — 
Life — is  bound  to  raise  a  few  eyebrows.  It  is 
also  almost  certain  to  catch  the  eye  of  the 
book  browser.  In  a  drama  bolder  and  more 
sweeping  than  Gone  with  the  Wmd^  Richard 
Fortey  sketches  the  full  story  of  life  on 
Earth,  the  stage  and  the  actors,  over  more 
than  four  billion  years.  Originally  published 
in  Britain  as  life:  An  Unauthorized  Biography 
(Haq>er  Colliris,  1997).  this  bright  brown 
volume,  plastered  with  the  imprint  of  At- 
chaeopteryx  (the  oldest  known  bird),  is  as 
encompassing  as  its  title  suggests.  Fortey,  se- 
nior palaeontologist  at  the  Natural  History 
Museum.  London,  takes  us  on  a  roller 
coaster  from  the  spawning  of  the  simplest 
unicellular  organisms  during  violent  infancy 
of  the  Earth;  through  monumental  crustal 
upheavals,  voyages  of  continents,  and  mass 
extinctions;  to  an  ending  at  the  dawn  of  hu- 
man-recorded history. 

The  key  to  this  book,  a  layperson's  guide 
to  the  secrets  of  fossils  and  environments 
most  ancient,  is  the  way  the  author  has 
magically  transposed  and  integrated  his  aca- 
demic biography 'and  intellectual  growth 
into  the  natural  history  of  life.  I  know  of  no 
other  "autobiography" — if  the  book  can  be 
called  one — quite  like  this,  where  the 
author's  life  is  stitched  into  such  an  im- 


The  author  rs  at  the  Deparimeni  of  Earth  Sciences, 
Laurentian  UniversiTy,  Sudbury.  Ontario,  Canada  P3E 
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mense  stretch  of  time.  Neatly  and  adroitly. 
Foney  weaves  his  personal  observations,  his 
encounters  with  scientists  (famous  and  less 
well  known),  and  his  introductions  to  con- 
troversies (century-old  and  contemporary) 
into  a  chronological  tapestry  of  life  on  Earth. 
The  text  literally  begir\s  with  Sakerella, 
the  vessel  that  in  1967  carried  Foney,  then  a 
young  Cambridge  undergraduate,  to  his  first 
field  season  in  Spitsbergen-  SahereUa  is  also 
one  of  the  oldest  shelly  fossils,  a  curious  Early 
Cambrian  genus  named  after  the  pioneering 


Ordoviclan  "sea  beetle."  Guaranteed  an  ex- 
cellent fossil  record  Dy  their  calcile  carapaces, 
trilobites  are  the  characteristic  creatures  of  the 
Early  Paleozoic.  (Ceraurus  pleurexanthemus. 
from  Ontario.) 

..--■-■'  I.-.  . 
trilobite  specialist  John  W.  Salter.  First  de- 
scribed in  1861  from  the  shores  of  Labrador 
(where  1  have  collected  thousands  of  the 
little  conical  shells  around  some  of  the  earli- 
est metazoan  reefs),  its  affinities  can  only  be 
guessed:  is  it  a  worm,  a  coral,  a  mollusk? 

Coincidence,  circumstance,  and  chance, 
and  their  effects  on  the  global  gene  pool 


through  time,  are  pervasive  themes  articu- 
lated throughout  the  book.  At  the  personal 
level,  Fortey  explores  how  one  chooses  a  ca- 
reer path,  who  happei\s  to  win  the  prizes 
and  scholarships,  and  who  loses  out  to  dis- 
appear from  sight.  In  the  fossil  record  we 
learn  about  the  luck  of  the  gene  draw,  evo- 
lution through  the  trials  of  mass  extinctions, 
the  consequences  of  changing  climates, 
continental  drift,  and  cosmic  impacts. 

The  book  has  many  strengths.  Fortey  lyri- 
cally raises  fossils  from  the  dead,  re-creating 
vibrant,  vivid  organisms  that  absorb  light, 
breathe,  eat,  function,  and  interact  with 
their  ecosystems.  Read  his  descriptions  of 
the  Middle  Cambrian  Burgess  Shale  from 
Canada  ("on  the  dark  shales  there  was  a 
fishmonger's  slabful  of  arthropods"),  a  Car- 
boniferous rainforest  ("the  air  is  so  humid  that 
the  moisture  congeals  upon  your  shoulders"), 
and  the  Eocene  Messel  Grube  from  Germany 
("imagine  a  delicate  bat,  Paiaeoduropteryx,  as 
fragile  as  a  paper  kite,  with  every  bone  laid 
out  upon  a  dark  slab,  as  if  it  had  been  waiting 
its  turn  as  an  extra  in  a  Dracula  movie"). 
The  author  presents  bites  of  life's  story  se- 
quentially, from  oldest  to  newest,  as  if  to 
suggest  (probably  rightly  so)  chat  the  past  is 
the  key  to  understanding  the  present  and  the 
future.  He  moves  continents  about  like  card- 
board ci''  outs  to  explain  migration  paths  of 
contin^  ..j1  tetrapods  and  plants.  He  lucidly 
spells  out  the  "rules  of  the  evolutionary 
game"  (which  organisms  needed  to  follow  to 
succeed,  compete,  and  survive  over  millenia), 
and  how  these  are  displayed  in  the  fossil 
record.  Fortey  provides  a  bird's  eye  view  of 
the  science  of  paleontology,  aiKl  an  insider's 
perspective  of  the  "psycho-cultural"  she- 
nanigans that  often  come  with  the  paleo- 
priesthood: :.  the  cladist  cult,  the  mass  ex- 
tinction dichotomy  of  catastrophists  and 
uniformitariaru,  the  taxonomic  schism  of 
splitters  and  lumpers,  the  heretic  leaders,  and 
the  hermits  who  wait  in  isolation  to  reach 
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Scientist  'sPlan: 
MapAllDNA 
Within  3  Years 

■y  NICHOLAS  WAOE  fl\. 

A  pioneer  <n  genetic  seipiendas 
and  a  private  company  are  joining 
forces  with  the  aim  of  dedpherlag 
ttie  entire  DMA,  or  genome,  of  hu- 
mans within  three  years,  far  faster 
and  cheaper  than  the  Federal  Gov- 
cnunent  is  plannlns. 

If  wccessful,  the  venture  would 
outiulp  and  to  some  extent  make 
redundant  the  Government's  (3  bil- 
ban  program  to  sequence  tlw  fauman 
genome  by  2005. 

Despite  a  host  of  new  questions, 
the  charting  of  tiw  full  human  ge- 

■  nome  would  offer  enormous  medical 
■ad  scicmtfic  benefits. 

The  principals  have  high  credlbll- 
tly  in  the  world  of  genome  sequenc- 
ing. They  are  Dr.  J.  Craig  Venter. 

for  Genomic  Sciences  in  RockviUe, 
Md..  and  Michael  W.  HunlcapUler. 
president  and  tedmical  maestro  of 
the  Applied  Biosystems  division  of 
the  Perldn-Elmer  Corporation  of 
Norwalk.  Conn. 

The  director  of  the  Federal  human 
genome  project  at  ti»e  National  Insti- 
tutes of  HealtJi.  Dr.  Francis  Collins, 
first  heard  of  the  new  company's 
plan  on  Friday,  as  did  the  director  of 
the  N.1J1..  Dr  Harold  Varmus.  Both 
(aid   that   the   plan.  If  successful, 

■  would  enable  them  to  reach  a  desired 
goal  sooner.  Dr.  Collins  said  he 
jlanned  to  Integrate  his  program 
with  tiie  new  cwnpanys  mrtlBWt. 
The  Government  would  adjust  by 
focusing  on  ttie  many  projects  that 
are  needed  to  tnierpret  the  luunan 
DNA  sequence,  such  as  sequencing 
the  genomes  of  mice  and  other  anl- 
malv 

Both  Dr.  Varmus  and  Dr.  Collins 
expressed  confidence  that  they  could 
nersuadi*  CnnfTft^  JA  mgtll  UW 
need  for  this  cfaanae  in  focus,  noiing 
that  the  sequencing  o(  mouse  and 
other  genomes  has  always  been  in- 
cluded as  a  necessary  part  of  the 
human  genome  project. 
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Mr.  Hunicaplller's  unit  is  a  princi- 
pal manufacturer  of  the  machines 
used  to  sequence  DMA,  or  determine 
the  order  of  chensical  units.  The  ven- 
ture will  be  financed  by  Perlcin- 
Elmer,  a  toiigtime  scientific  instru- 
ment maker  that  has  recently 
branched  into  the  genome  field  under 
ttK  leadership  of  its  new  chief  execu- 
tive, Tony  1_  White. 

A  plan  to  form  a  new  company  for 
I  the  venture  was  appicived  by  Perkln- 
Elmer's  board  on  Friday  afternoon. 
The  project  could  have  wide  ramifi- 
cations for  IndustTy.  academia  and 
tiie  piiblic  because  It  would  make 
possible  almost  overnight  many  de- 
velopments tliat  bad  been  expected 
to  unfold  over  the  next  decade. 

One  such  development  is  individ- 
ualized medicine,  tlie  tailoring  of 
drugs  and  other  treatments  to  pa- 
tients depending  on  specific  varia- 
tions in  their  DNA  sequence.  The 
wide  availability  of  individual  DNA 
sequences  would  raise  more  urgent- 
ly the  kngstanding  but  unresolved 
issues  of  privacy  and  control  of  ge- 
netic information. 

The  possible  possession  or  control 
of  the  entire  iuunan  gentime  by  a 
single  private  company  could  also 
become  an  issue  of  public  concern. 

The  new  venture  was  conceived 
only  a  few  months  aga  Mr.  Hunka- 
piller  believed  tiiat  a  new  generation 
of  sequencing  machines  coming  on 
line  would  be  so  fast  that  the  whole 
human  genome  could  be  completed 
far  sooner  and  10  times  more  cheap- 
ly than  envisaged  by  the  National 
Institutes  of  Health. 

He  approached  Or.  Venter,  who 
had  developed  tl>e  Mea  for  a  new 
sequencing  strategy  but  lacked  the 
means  to  execute  it.  The  two  men 
concluded  In  January  that  it  would 
be  pocsibie  to  sequence  the  three 
billion  letters  of  human  DNA  within 
three  years,  at  a  cost  of  $150  million 
toSZOO  mlUloa. 

The  >3-hiUion  Federal  program,  by 
cnrtT»sL  is  BOW  at  the  haHway  pomt 
of  its  15-ye«"  course,  and  only  3  pef- 
tent  of  tlie  »enom'<  Hli  IWtine- 
qu^ced  the  strategy  has  been  to 
.dIV!3e~!he  task  and  assign  parts  to 
various  universities.  Although  the 
program  has  ttad  many  successes  in 
pioneering  a  daunting  task,  serious 
doubts  have  emerged  as  to  whether 
the  universities  can  meet  rht  target 
date  of  2005. 


The  human  genome  contains  all 
the  instructions  —  some  60,000  or  so 
genes  —  needed  to  design  and  oper- 
ate the  human  organism.  Dedplier- 
ing  the  script  in  which  the  instruc- 
tions are  written  —  the  chemical 
units  of  DNA  —  would  yield  a  trove  of 
knowledge  about  human  physiology 
and  disease,  as  well  as  the  power,  in 
principle,  to  comeci  the  errors  in 
DNA  programming  that  cause  genet- 
ic disease.  The  genome,  once  deci- 
phered, is  likely  to  be  seen  as  the 
foundation  of  human  biology,  and 
.  hence  is  the  object  of  intense  sclentif - 
'  ic  and  commercial  interest. 

The  proposal  to  substantially  com- 
plete the  human  genome  in  three 
years  would  seem  extreme  hubris 
coming  from  almost  anyone  but  Dr. 
Venter.  But  other  experts  deemed 
his  approach  technically  feasible. 

"It's  not  impossible  at  all  that  he 
could  succeed,"  said  Dr.  William  A. ' 
Haseltine,  chief  executive  of  Human 

Genome  Sciences  of  RockvUle,  Md. 
"he  has  oemonstrated  a  fuie  track 
record  of  innovation  and  organiza- 
tion." 

Dr.  Haseltine's  company  was  for 
several  years  In  uneasy  partnership 
with  Dr.  Venter's  instttme. 

If  successful,  the  new  venture 
seems  likely  to  Impose  adjustments 
on  all  tlie  others  involved  in  genome 
research,  and  to  offer  new  opportuni- 
ties. Congress,  for  instance,  might 
ask  why  it  should  continue  to  finance 
tiie  human  genome  project  through 
the  National  Institines  of  Health  and 
the  Department  of  Energy  if  tiie  new 
company  is  going  to  finish  first. 

The  sponsors  of  the  new  venture 
insist  that  ttiere  will  be  more  work 


A  new  private 
venture  has  lofty 
goals  but  also 
much  credibility. 


for  the  human  genome  project  par- 
ticipants to  do,  not  less,  because  ob- 
taining the  DNA  sequence  is  only  tiie 
first  step  toward  understanding  what 
tlie  genetic  instructions  mean  and 
how  tliey  operate. 
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"There  is  a  strong  case  for  Con- 
gress to  Increase  funding  (or  this 
work."  said  Mr.  White  of  Perkiji- 
Elmer.  "The  post-genomic  world  will 
be  much  more  exciting." 

With  the  new  company,  Perkin- 
Elmer  would  seem  (or  the  (irst  time 
to  be  stepping  into  direct  competition 
with  the  customers  who  buy  its  se- 
quencing machines  and  other  ge- 
nome-analysis equipment.  Mr. 
While,  however,  has  no  evident  ambi- 
tion.s  to  become  the  Bill  Gates  o(  the 
genome  world. 

"We  are  anxious  to  talk  to  anyone 
who  might  (eel  threatened  by  this  to 
make  very  sure  that  we  are  doing 
tomething  compatible."  Mr.  White 
said. 

Even  Dr.  Venter,  who  is  known  (or 
his  direct  approach,  said,  "We  are 
trying  to  do  this  not  with  an  in-your- 
(ace  kind  o(  attitude."  He  added  that 
he  intended  to  work  closely  with  the 
National  Institutes  of  Health. 

Dr.  Venter  forecast  that  the  pos- 
session o(  the  human  genome  se- 
quence would  stimulate  new  direc- 
tions in  medicine  and  biology,  just  as 
his  sequencing  of  the  (irst  banerial 
genome  has  led  to  a  wave  o(  other 
microbes  being  spun  through  se- 
quencing machines.  He  said  he  in- 
tended to  build  a  network  o(  collabo- 
rators around  the  world  to  work  on 
human  genetic  diseases. 

Dr.  Venter  and  his  new  colleagues 
plan  not  just  to  sequence  the  human 
genome  but  to  construct  a  "de(ini- 
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tive"  data  base  that  will  integrate 
'  medical  and  other  in(ormatlon  with 
the  basic  DNA  sequence.  An  Impor- 
<  tant  component  o(  the  new  dau  base 
I  will  i>e  human  polymorphisms,  the 
'  geneticists'  term  for  commonly 
(ound  variations  in  DNA.  Though  all 
people  and  ethnic  groups  are  tliought 
to  have  an  overwhelmingly  similar 
sequence  at  DNA  letters  in  their  ge- 
nome, tiiere  are  many  minor  varia- 
tions at  certain  sites  on  the  genome, 
and  these  variations  make  each  indi- 
vidual unique. 

The  new  company's  data  base 
seems  likely  to  rival  or  supersede 
Genbank,  the  data  t>ank  operated  by 
the  National  Instttmes  of  Health. 

Having  so  much  information  in<he 
control  o(  one  csmpany  is  also  likely 
to  be  a  matter  of  some  public  con- 
cern. 

"The  question  is,  can  the  moral 
and  legal  questions  be  addressed  if 
the  largest  scientific  revolution  of 
the  next  century  is  going  to  be  done 
under  private  auspices?"  said  Dr. 
Arthur  Caplan,  an  ethicist  at  the  Uni- 
versity of  Pennsylvania  with  whom 
Dr.  Venter  has  discussed  ttie  new 
company's  goals. 

The  issues  of  geiMtic  counseling 
and  insurance  have  been  around  for 
some  time.  Dr.  Caplan  noted,  but  the 
new  company's  plans  "accentuate 
the  need  to  improve  statutes  govern- 
ing the  control  of  genetic  informa- 
tion." 


PerkiivElmer  intends  to  be  spar- 
ing in  laying  claim  to  intellectual 
property  rights  over  the  genome,  be- 
lieving the  company  will  create  more 
demand  (or  its  machines  1(  it  allows 
its  sequences  to  be  widely  accessible. 
Mr.  White  said  his  company  bad  a 
track  record  o(  liberally  licensing  its 
inventions  so  as  to  improve  the 
chances  o(  tlieir  becoming  the  indus- 
try standard. 

Whether  the  new  company  could 
gain  a  significant  lock  on  the  human 
genome  In  terms  o(  patents  Is  not  at 
all  clear.  Human  Genome  Sciences, 
for  example,  has  already  obtained 
the  full-length  sequence  of  80  percent 
of  human  genes,  Dr.  Hasehine  said, 
and  has  presumably  filed  patent  ap- 
plications. The  new  company  may 
therefore  (ind  that  oiiiers  have  beat- 
en it  to  tlie  treasure  trove. 

Even  tliough  many  have  now  been 
sequenced,  genes  constitute  only  3 
percent  o(  the  total  genome.  Dr.  Ha- 
seltine  suggested  that  the  long  re- 
gions o(  DNA  in  between  the  genes 
were  like  cosmology,  (ascinating  to 
know  about  but  o(  little  commercial 
Interest. 

The  new  company  will  be  80  per- 
cent ovimed  by  Perkin-Elmer,  with 
Dr.  Venter  and  others  owning  the 
balance.  Dr.  Venter  said  he  would 
resign  as  president  of  the  institute 
for  Genomic  Sciences,  his  place  be- 
ing taken  by  Dr.  Claire  Eraser,  his 
wife. 
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Perkin-Elmer  Jumps  Into  Race  to  Decode  Genes 


By  Bill  RiCKAims    IpY 

Staff  Rrponrr  efTm  Wau.  S-mirr  JoimxAi. 

Sdentific-lnstrument  maker  PerfclD-El- 
Mer  Corp.  said  It  will  Join  oik  of  tbe 
nation's  leading  genetic  researchers  in  a 
bold  venture  to  speed  up  the  decoding  of 
human  genes. 

PerUn-Etmer.  a  Norwalk,  Conn.,  com- 
pany that  recently  moved  into  the  genetic- 
•equendng  field,  caid  Saturday  it  signed 
letters  of  intent  with  J.  Craig  Venter  and 
Dr.  Venter's  Institute  for  Genomic  Re- 
search to  form  the  project  They  said 
they  expect  state-of-the-art  sequencers 
from  Perkin-EUmer's  Applied  Biosystems 
Division  to  give  Dr.  Venter's  new  project 
greater  genetic-sequendng  capacity  than 
the  entire  current  world  genetic-sequenc- 
ing output. 

The  announcement  brings  a  new  com- 
petitor to  a  race  already  being  run  by  a 

host  of  companies,  including  locyte  Phar- 
maceuticals anri  Hyman  Cntamr  Self  nces. 
With  which  Dr.  Venter  was  anmaied. 
Researchers  are  continually  improving  the 
speed  and  accuracy  of  decoding  tech- 
niques, and  it  remains  to  be  seen  whether 
the  new  project  represents  a  major  ad- 
vance or  simply  an  incremental  step,  ana- 
lysts say. 

'Sequencing  the  human  genome  -  the 
sum  of  DMA.  which  contains  the  inherited 
instructions  for  devetopment  -  is  the  pro- 
cess of  identifying  the  precise  order  of  the 
genetic  letters  that  make  up  DNA.  With 
this  sequence  in  hand,  scientists  expect  to 
be  able  to  more  easily  identify  the  esti- 
mated SO.tXW  or  so  genes  titat  make  up  the 
entire  genetic  map.  Scientists  hope  to 
pinpoint  all  the  genes  sometime  around  the 
year  2010.  but  it  will  still  take  years  after 
that  to  figure  out  what  the  genes  actually 

do.~~ 

The  stepped-up  capability,  the  project's 
leaders  have  told  federal  officials,  could 
cut  as  much  as  three  or  four  years  off  the 
complete-decoding  timetable  for  the  hu- 
man genome.  The  National  Institutes  of 
Health's  human-genome  project  has  se- 
quenced only  about  3%  of  the  three  billion 
base  pairs  of  DNA  that  make  up  the  human 
genome. 

"This  will  help  us  to  get  to  our  goal  a 
littie  sooner,  and  that  is  good  news."  said 
Dr.  Prands  Collins,  director  of  the  NIH's 
National  Genetic  Research  Institute, 
which  is  conducting  the  human-genome 
prpjecL 


But  Dr.  OoUios  and  NIH  Director  Dr. 
Harold  Varmus  said  yesterday  that  re- 
searchers at  the  doxen  genome  centen 
now  working  on  the  federal  project  still  will 
have  plenty  to  do.  "If  the  complete  genome 
Is  like  an  instruction  book,  what  Dr.  Ven- 
ter's group  will  have  when  they  are  done 
would  be  Uke  a  group  of  paragraphs 
that  still  need  to  be  tied  together."  said  Dr. 
CoUins. 

Drs.  OdOins  and  Vannus  said  they  only 
learned  of  the  new  venture  at  a  briefing  on 
Friday.  They  said  the  project's  senior 
officials  assured  them  that  whatever  infor- 
mation Is  developed  will  remain  In  the 
public  domain.  Por  example,  drug  compa- 
nies working  on  developing  new  geneti- 
cally engineered  pharmaceuticals  would 
be  able  to  go  to  Dr.  Venter's  group  and 
license  infonnation  for  a  fee. 

In  New  York  Stock  Exchange  composite 
trading  Friday,  before  the  news.  Perkin- 
Elmer  dosed  at  S68.50,  up  43.75  cents. 

Some  researchers  have  voiced  concern 
that  the  first  private  company  to  decode 
the  human  genome  would  be  able  to  com- 
pletely control  future  genetic  engineering, 
as  software  giant  MIcrDsoft  Corp.  has  been 
able  to  coativl  the  development  of  com- 
puter software.  "We  were  given  assur- 
ances they  don't  plan  to  lock  it  up,"  said 
Dr.  Collins.  The  new  company  said  it 
"plans  to  make  sequencing  data  publicly 
available  to  ensure  that  as  many  research- 
ers as  possible  are  examining  it." 

While  titere  have  been  rumors  in  the 
scientific  community  that  a  private  com- 
pany might  step  up  to  the  challenge  of 
deciphering  the  entire  human  genome, 
Perkin-Elmer's  venture  is  the  first  to  take 
that  step.  The  company  said  yesterday 
that  it  has  developed  "a  breaiohrougti 
DNA-analysis  tecfanotogy"  that  wiu  vasUy 
speed  up  the  sequencing  process.  Perkin- 
Elmer  said  its  new  analyzers  will  cost 
about  S3O0,00O  each  and  will  be  ready  for 
the  commercial  maitet  early  next  year. 

The  NIH's  Dr.  Vannus  called  the  com- 
pany's tecfanofaiglcal  advance  "a  stepping 
(tone  "  to  hastening  the  decoding  of  the  hu- 
man genome.  "They  appear  to  have 
pushed  technology  to  the  next  notch."  Dr. 
Ooliins  added. 

Dr.  Venter's  participation  In  the  new 
sequencing  company  gives  it  unusual  legit- 
imacy in  a  field  where  optimism  has 
sometimes  outstripped  reality.  In  the  past 
few  years.  Dr.  Venter  and  his  Rockville. 


Md..  Institute  have  pioneered  methods  for 
quickly  deciphering  the  entire  genetic  se- 
quence qf  bacteria.  The  institute  recently 
identified  <he  genetic  sequences  for  mi- 
crobes that  cause  Lyme  disease,  syphUis 
and  stomach  ulcers. 

Under  the  .igreement,  Perkin-Elmer 
will  own  W%  o(  the  new  qompany.  to  be 
based  in  Rockville. 
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Beyond  Sequencing  of  Human  DNA 


By  NICHOLAS  WADEA^ 

THE  sequencing  of  the  human 
genome,  a  historic  goal  in  bio- 
medical research.  was 
snatched  away  last  Friday  from  its 
.  Government  sponsor,  the  National 
Institutes  of  Health,  by  a  private 
venture  that  says  it  can  get  the  job 
done  faster.  Now  Government  offi- 
cials are  scrambling  to  adjust  to  the 
stunning  turn  of  events,  saying  that 
the  task  of  interpreting  the  genome 
may  begin  much  sooner  now,  and 
thai  there  is  every  reason  for  Con- 
gress to  continue  to  fund  the  project. 
Having  the  human  ONA  sequence 
in  hand  much  earlier  than  anticipat- 
ed will  significantly  accelerate  the 
pace  of  biomedical  research.  "Peo- 
ple will  sign  on  to  the  concept  that 
genome  sequences  are  the  underpin- 
ning of  biology,"  said  Dr.  Richard 
Robens.  a  Nobel  prize  winner  who  is 
the  research  direaor  of  New  Eng- 
land Biolabs.  "I  thinic  we  are  enter- 
ing the  most  exciting  era  of  biology. 


Adjusting  to  a  bold 
new  entry  in  the 
genome  race. 


Finally  we  might  understand  what 
life  is  and  how  it  works.  The  genome 
is  just  a  start" 

The  takeover  of  the  human  ge- 
nome project  is  a  venture  of  unusual 
audacity.  Almost  equally  remark- 
able Is  that  other  genome  experts 
seem  to  accept  with  little  reservation 
that  ttie  abduaors  have  a  reasonable 
chance  of  making  good  on  their 
claim  to  substantially  complete  the 
human  genome,  starting  from 
scratch.  In  three  years.  The  National 
Institutes  of  Health  had  planned  to 
complete  the  sequence  by  the  year 
2005,  after  a  15-year  program  costing 
Si  billion. 


The  new  venture  will  be  financed 
by  Perfcln-Eliiier,  the  scientific  in- 
strument maker,  at  an  estimated 
cost  of  only  $200  million.  The  Idea 
was  conceived  by  Michael  W.  Hunka- 
piller,  head  of  Perfcin-Eliner's  Ap- 
plied Biosystems  division.  "I  won't 
say  Mike  is  a  genius  because  he'd  hit 
me  up  for  a  raise."  Tony  L.  White, 
the  chief  exectitive  of  Perkin-Elmer, 
said  last  week.  An  aide  added,  "Let's 
just  say  he  is  smart" 

Dr.  HunkapiUer  is  one  of  the  co- 
inventors,  akng  with  Or.  Leroy  Hood 
of  the  University  of  Washington,  of 

the  DNA  sequencing. machines  that 
determine  the  order  of  the  ctiemical 
units  in  the  genetic  material.  His 
division  recently  developed  a  new 
model  of  their  standard  sequencing 
machine,  one  that  is  more  highly 
automated  and  altows  the  machines 
to  work  round  tite  clock  with  very 
little  attendance.  Dr.  HunkapiUer  re- 
alized the  new  machines  were  so 
much  more  efficient  than  their  pred- 
ecessors that  a  roomful  of  200  or  so 
might  t>e  able  to  complete  the  whole 
human  genome  in  just  a  few  years. 

The  human  genome,  with  3  billion 
units  of  DNA  altogether, Is  distribut- 
ed over  23  chromosomes,  each  of 
which  is  a  single  DNA  molecule 
about  100  mlllkm  units  kmg.  Dr.  Hun- 
kapiller's  machines  can  determine 
the  order  of  units  in  fragments  of 
DNA.  which  are  about  SOO  units  in 
length.  Some  60  million  of  these  over- 
lapping. SOO-unit  pieces  of  DNA  must 
tten  be  reassembled  to  give  the  se- 
quence of  the  full-length  chromo- 
somes from  which  they  are  derived. 

The  reassembly  process  is  far 
from  straightforward,  and  Dr.  Hun- 
kapiUer turned  to  Dr.  J.  Crmig  Ven- 
ter, a  leading  DNA  sequencer  who 
heads  the  Institute  for  Genomic  Re- 
search in  RockvUle,  Md.  He  Invited 
Dr.  Venter  to  a  meeting  and  told  him 
he  thought  it  might  be  possible  to 
sequence  the  whole  genome.  "Craig 
said,  'You've  got  to  be  crazy,'  "  Dr. 
HunkapiUer  said.  "We  spent  a  few 
days  working  through  the  math  and 
came  away  thinking  mayt>e  It's  do- 
able. They  went  back  and  redid  the 
calculations  and  so  did  we." 

The  idea  of  a  single  organization 
cracking  the  genome  in  a  single  pro- 
cedure, known  as  a  shotgun  experi- 
ment. Is  extremely  bold.  Under  the 
approach  adopted  by  the  National 


Institutes  of  Health,  half  a  dozen 
university  laboratories  are  worUng 
on  the  sequence,  each  tackling  a  dif- 
ferent chromosome. 

Dr.  Francis  Collins,  the  N.LH.  di- 
rector of  the  human  genome  project, 
is  pnnid  of  tlielr  progress,  noting  that 
4  percent  of  the  genome  has  already 
been  sequenced,  whereas  the  initial 
plan  called  for  only  1  percent  to  be 
completed  by  this  stage.  But  some 
scientists  in  ttie  biotechnology  indus- 
try say  N.l.H.'s  management  of  this 
industrial-4cale  project  has  been 
flawed  from  the  start 

"There  have  been  serious  prob- 
lems of  organization  and  manage- 
ment both  at  the  Department  of  En- 
ergy and  at  N.l.H,,"  together  with 
mtemal  dissetislon  among  the  senior 
scientists  Involved,  said  Dr.  William 
A.  Haseltine,  chief  of  Human  Ge- 
nome Sciences,  a  genome  sequencing 
company  in  Rockvilie,  Md. 

That  issue  will  be  moot  if  the  se- 
quencing of  human  DNA  is  assumed 

by  the  new  private  venture.  Howev- 
er, It  is  hard  to  see  how  the  new 
venture  could  have  started  without 
the  substantial  groundwork  laid  by 
N.l.H.  and  by  the  university  pro- 
grams it  fimded,  particularly  ttie 
team  at  Washington  University  at  St. 
Louis,  led  by  Dr.  Robert  Waterston. 

Recognizing  the  credibility  of  the 
new  venture  by  Dr.  Venter  and  Per- 
Un-Elmer,  N.l.H.  officials  are  pre- 
paring to  persuade  Congress  to  con- 
tinue funding  the  genome  project  but 
to  switch  ttie  focus  from  getting  the 
aequence  to  tlie  enormous  tasking  of 
interpreting  it.  Dr.  Venter  plans  to 
enter  his  findings  in  a  public  data- 
base. 

One  essential  aid  to  understanding 
ttie  human  genome  is  to  sequence  the 
surprisingly  similar  genome  of  the 
mouse.  Though  all  biologists  recog- 
nise ttie  need  for  such  a  project  it 
may  not  be  immediately  clear  to 
members  of  Congress  ttiat  having 
forfeited  the  grand  prize  of  human 
genome  sequence,  they  should  now 
be  equally  happy  wjth  the  glory  of 
paying  for  similar  research  on  mice. 

The  new  venture  accentuates  the 
emergmg  importance  of  genomics  as 
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the  central  framework  of  biology  and 
medicine.  "There  is  a  real  treasure 
trove  to  be  found  in  the  total  genome 
and  its  evolutionary  history,  particu- 
larly as  other  genomes,  those  of 
chimpanzees,  new  and  old  world 
monlceys  and  mice,  become  se- 
quenced," said  Dr.  Haseltine.  "Once 
that  picture  is  put  together  we'll 
have  a  very  good  idea  of  our  evolu- 
tionary history." 
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Scicntiss  \«lerday  said  the>- 
utMid  form  a  new  company  in  Rock- 
\tUc  that  aims  to  unravel  the  entire 
human  genetic  code  by  the  year 
2001.  four  years  sooner  than  the 
(edcraJ  Kuvemmenl  expects  to  com- 
plete' a  similar  project 

TIh-  pnvateh-  funded  enterprise, 
which  hacker-ji  said  could  be  cornplet- 
ed  kT  perhaps  one<enlh  the  cost  of 
the  government  pntgrim.  raised  im- 
mediate questMms  about  the  rele- 
vancr  and  future  of  the  S3  billion. 
15-)"ear  federal  effort.  It  also  raised 
fresh  concern";  about  the  prospect  of 
the  human  genetic  code  being  expro- 
priated !)>■  entreprenetjrs  who  plan  to 
patent  and  scD  aa-es,-  to  the  most 

medically  valuable  parts. 

Some  biotechnokigy  experts  not 
involved  in  the  new  company  raved 
about  the  venture,  saying  it  promises 
to  generate  enormous  amounts  of 
genetic  data  that  may  quickly  be 
translated  into  better  diagnostic  lesL<: 
and  treatments  for  diseases. 

But  other  experts  expressed  skep- 
ticism that  the  company  could 
achieve  its  ambitious  goals,  saying 
the  new  technology  remains  unprov- 
en  and  the  novel  aiulytical  approach 
to  be  used  may  generate  less  useful 


information  than  other  methods. 

Federal  officials  said  the  accelerating  govern- 
ment effort  to  find  and  decode  all  60.000  or  iTK>rc 
genes  in  the  human  body  woukl  remain  on  iL.'^ 
current  course  for  the  next  12  to  18  months,  by 
which  time  it  wiU  be  dearer  whether  the  prpjea 
should  change  its  approach  to  accommodate  the 
new  players  in  the  Sdd. 

It  would  be  vastly  premature  to  go  out  and  .. . 
change  the  jha  of  our  gerxxne  center-.'  said 
Francis  CoDms,  head  of  the  National  Human 
Genome  Research  Institute,  the  branch  ol  the 
National  Institutes  of  Health  that  ctKiirects  the 
federal  effort  with  the  Department  of  Energj-. 

The  new  company — not  yet  named — wiUbeled 
by  J.  Craig  Venter,  a  pioneer  in  finding  fasL  cheap 
ways  to  decode  genetic  information.  It  will  be 
badted  by  Perkin-Ebw  Corp.  of  Norwalk.  Conn., 
a  major  supplier  of  etjuipment  for  genetic  anah'sis. 
and  wiD  depend  on  machines  developed  by  Perkin- 
umer. 

The  new  company  wiD  lease  space  near  Shad>' 
Grove  Adventist  Hospital  just  off  Interstate  27D  in 
(Vlontgamery  Comity's  hooming  biotechnokigy 
corridor.  Venter  said.  The  new  venture,  which 
expects  to  go  into  operation  early  in  1999,  will  be 
80  percent  owned  by  Perkin-Ebner. 

The  company  wiD  employ  between  400  and  800 
people  to  run  230  spedafized  new  machines — each 
about  the  size  of  a  minibar — that  wiD  operate  24 
Ikmis  a  day  decoding  information  from  human 
genes  that  have  been  isdated  from  sperm  and 
other  rrlK  Venter  said.  The  electric  bQl  alone  is 
expected  to  hit  $5,000  a  day. 

Venter  helped  found  Human  Genome  Scence< 
Inc  of  RockviQe.  the  first  private  company  in  the 
iBtion  to  amass  larp  amounts  of  genetic  data,  and 
now  heads  the  nonprofit  Institute  for  Genomic 
Research,  also  in  RodcviDe. 

Several  biotechnology  oooipanies.  including  Hu- 
man Genome  Sdeixzs,  are  in  the  business  of 
decoding  genetic  iufaimtion  and  selling  it  to 
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pharmaceutical  companies  and  others  who  hope  to 
profit.  Most  of  these  biolech  companies  daim  to 
have  decoded  nxirc  than  80  percent  of  human 
genes  already,  although  the  functions  of  most 
remain  a  myster>'. 

These  companies  have  been  eranied  scores  o( 
patents  on  their  genetic  discoveries,  raising  fears 
among  some  critics  that  a  handful  of  companies 
will  control  the  coiitfncrcialiation  of  a  vast  and 
potentially  lucrative  biological  resource.  Those 
fears  arose  again  yesterday  with  Venter's  an- 
nouncement of  his  r>ew'  project. 

"Even  though  they  are  promising  public  access. 
they  control  the  terms  and  there  is  a  history  of 
terms  being  more  onerous  than  Ls  aazptaUe  to 
most  sdentaas,"  said  Maynard  Olson,  a  medical 
geneticist  at  the  Universit>-  of  Washington. 

Venter  said  that  with  the  exception  of  perhaps 
100  to  300  genetic  sequences  that  he  ecpects  will 
show  s^)ecial  commercial  promise,  the  company 
will  make  aE  the  genetic  information  available  free 
to  the  worlds  srientists.  It  would  be  moialh- 
wrong  to  hold  the  data  hostage  and  keep  it  secret' 
he  said. 

PeridivBmer  senior  vice  president  Michael  W. 
Hunkapiller  said  the  company  wiD  make  mone\'  b\- 
analyzing  the  genetic  information  and  then  scDinc 
the  results  to  pharnuceutical  companies.  The 
company  also  plans  to  anahT*  the  tiny  genetic 
differeiKSS  between  individual*,  as  opposed  to 
getting  a  "generic"  genetic  sequence  for  the 
average  human  being.  That  new  level  of  informa- 
tion, also  being  sought  by  federal  laboratories,  may 
help  drug  companies  customize  medicines  for 
individuals  or  smaD  groups  of  people. 

Venters  technique  wiD  differ  markedlj'  from 
that  being  used  by  bioiech  companies.  Those 
cmnpanies  use  a  shortcut  that  deliberately  omits 
large  amounts  of  information  whose  role  in  the 
body  is  undear. 

By  oontra.st.  Venter's  pn>iect  aims  to  unravel 
every  bit  of  genetic  information.  re^utUcss  of 
whether  it's  suspected  to  be  usefiil.  and  to  organize 
the  resulting  database  into  a  ma.ssive  and  readih- 
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consulted  blueprinl  of  human  life. 

To  do  so.  the  Prrkin-Dmer  nachines  vnH  use  a 
controvereiai  approach  called  'shotgun  whole  pe- 
nomc  sequencing."  Instead  of  focu-sing  on  large 
pieces  of  DNA.  this  process  decodes  tin>-  pieces 
thai  bier  must  be  assembled  Kke  interlocking 
pieces  of  a  jigsaw  puzzle.  Be\3u<e  of  the  added 
dilficuhy  of  dealing  with  so  many  sroaD  pieces,  the 
resuhing  picture  of  the  human  genome  is  likd>'  to 
be  pcppeivd  with  more  and  larger  holes  than  that 
produced  b>-  the  federal  program.  Collins  said. 

The  government  consideied  switching  to  the 
approach  that  Venter  will  use  a  few  years  ago. 
Coffins  said,  and  "roundl>'  rejected"  it  as  too 
problematic.  But  Venter  and  others  said  recent 
tedmical  improvements  make  the  approach  superi- 
or. 

Executives  of  biotechnoloo-  companies  in- 
voK'ed  in  genetic  research  have  long  argued  that 
the>-  could  do  the  work  of  the  federal  genome 
project  faster  and  more  cheapK-.  ^\'^liiam  Hasehine. 
head  of  Human  Genome  Sciences,  yesterday  called 
the  governments  program  a  "gra\y  train"  and 
feuhed  its  leaders  for  what  he  descnbed  as  a  iiaihire 
to  cnhst  private  industry'. 

While  expressing  some  doubt  thai  Venter  and 
Peridn-Elmer  would  find  ways  to  make  money  on 
their  new  endeavor,  he  said  he  had  little  doubt  they 
would  jajcceed  in  decoding  the  entire  human 
genome  in  three  years. 

"This,  has  to  fed  Uke  a  bomb  dropped  on  the 
head  of  the  Human  Genome  Project,"  Haseltinc 
said  b>-  teJephaie  from  Frankfurt.  "All  of  a  sudden 
somebody  is  gomg  to  paB  a  S3  biDion  rug  out  from 
under  >txi?  They  mu5t  be  deeply  shocked." 
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Genetic 
mapping 
triggers 
contest 

Academics  race 
private  enterprise 
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By  Clive  Cookson 

FiNANCIAi.  TIMES 


LONDON  —  The  race  between 
academic  and  commercial  interests 
to  unravel  the  entire  human  genet- 
ic code  took  another  twist  Wednes- 
day when  the  British-based  Well- 
come TVust,  the  world's  largest 
charity,  announced  that  it  would 
spend  an  extra  $184  million  on  the 
projea  over  the  next  seven  years. 

The  trust's  commitment,  on 
behalf  of  the  public  sertor,  is  a  chal- 
lenge to  the  commercial  genomics 
venture  announced  in  the  United 
States  last  weekend. 

Perkin-Elmer,  the  scientific 
instrumentation  company,  said  it 
would  set  up  a  new  company  with 
Craig  Venter,  president  of  the  Insti- 
tute for  Genomic  Research,  "to  sub- 
stantially complete  the  sequencing 
of  the  human  genome  [all  human 
DNA]  within  three  years." 

Wellcome  said  in  a  statement 
Wednesday:  "The  TVust  is  con- 
cerned that  commercial  entities 
might  file  opportunistic  patents  on 
DNA  sequences." 

The  trust  is  conducting  an  urgent 
review  of  the  credibility  and  scope 
of  gene  patents.  In  a  clear  threat  to 
Perkin-Elimer  and  other  commer- 


cial organizations,  Wellcome  said  it 
"is  prepared  to  challenge  such 
patents." 

The  Human  Genome  Project  — a 
$3  billion,  15-year  effort  to  spell  out 
all  3  billion  chemical  "letters"  in 
human  DNA  —  was  started  in  1990 
in  the  public  sector,  with  funding 
mainly  from  the  U.S.  government 
But  during  the  1990s  the  private 
sector  moved  in,  led  by  Human 
Genome  Sciences,  a  U.S.  biotech- 
nology company. 

Now  there's  intense  competition 
—  not  only  between  gene-hunting 
companies  but  also  between  the  pri- 
vate and  academic  sectors  as  a 
whole. 

The  private  sector  says  the  prof- 
it motive  is  accelerating  the  medical 
application  of  genetic  information, 
while  the  academics,  led  by  the 
Wellcome  "Ihjst.  claim  that  compa- 
nies are  delajong  progress  by  pre- 
venting the  open  release  of  infor- 
mation. 

The  trust's  new  commitment  will 
bring  its  total  spending  on  the 
Human  Genome  Project  to  $328 
million.  The  work  is  based  at  Well- 
come's  new  Genome  Campus  in 
Cambridge,  England,  where  DNA 
sequences  are  released  freely  on 
the  Internet  as  they  are  produced. 
In  the  United  States,  Venter  plans 
to  use  ultrafast  DNA  sequencing 
machines  developed  by  Perkin- 
Elmer,  together  with  a  new  scientific 
strategy,  to  move  ahead  faster  than 
the  public-sector  genome  project. 
The  new  company  is  expeaed  to 
have  a  research  budget  of  about 
$200  million. 

Although  the  data  will  be  made 
publicly  available  after  a  delay,  the 
company  plans  to  build  up  a  com- 
mercial database  and  to  patent  some 
genes. 

Michael  Morgan,  who  runs  Well- 
come^ genomics  program,  said  \fen- 
ter's  shotgun  approach  remained 
speculative  and  had  not  been  proved 
to  work.  "At  best  it  will  give  a  quick 
and  dirty  version  of  the  genome,"  be 
said. 
•Distributed  by  Saipps  Howard 
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International  Gene  Project  Gets  Lift 

Wellcome  Trust  Doubles  Commitment  to  Public-Sector  Effort 


.p&-° 


By  NICHOLAS  WADE^ 

The  politics  of  the  hutnan  genome 
project,  the  plan  to  sequence  or  ana- 
lyze the  entire  DNA  of  human  cells, 
has  become  suddenly  more  compli- 
cated, on  both  a  personal  and  inter- 
national level 

The  project,  a  glittering  scientific 
prize  expected  to  form  the  underpin- 
ning of  biology  and  medicine  in  the 
next  century,  is  a  S3  billion  Federal 
effon,  bolstered  with  a  significant 
British  contribution,  that  aims  to  de- 
code the  three  billion  chemical  let- 
ters of  human  DNA  by  2005. 

This  program,  now  half  way 
through  Its  IS-year  course,  was  up- 
staged by  the  announcement  on  May 
10  that  a  private  company  would 
Stan  and  aim  to  complete  the  human 
DNA  sequence  in  three  years  at  a 
fraction  of  the  cost. 

Now  the  Wellcome  Trust  of  Lon- 
don, the  world's  largest  medical  phi- 
lanthropy, has  stepped  into  the  fray 
in  an  effort  to  maintain  the  impetus 
of  the  publicly  financed  program  and 
to  prevent  the  human  genome  se- 
quence from  falling  under  the  control 
of  a  private  company. 

The  trust  said  this  week  that  it 
would  double  the  money  it  gives  to 
the  Sanger  Centre  near  Cambridge, 
England,  enabling  biologists  there  to 
sequence  one-third  of  the  genome,  up 
from  their  previous  goal  of  one-sixth. 
In  addition,  the  trust  said  it  stood 
ready  to  pay  (or  half  of  the  entire 
human  genome,  or  DNA  sequence. 

"To  leave  this  to  a  private  compa- 
ny, which  has  to  make  money,  seems 
to  me  completely  and  utterly  stu- 
pid." said  Dr.  Michael  J.  Morgan, 
program  director  for  the  Wellcome 
Trust. 

Asked  if  the  trust  was  prepared  to 
finance  the  sequencing  of  the  entire 
human  genome.  Dr.  Morgan  said.  "If 
we  had  to  and  if  we  wanted  to.  we 
could  do  it."  The  Wellcome  Trust,  he 
noted,  has  assets  of  S19  billion. 

The  Wellcome  Trust's  firm  sup- 
pon  of  the  existing  program  seems 
to  have  had  a  bracing  effect  on  its 


American  partner,  the  National  In- 
stitutes of  Health.  Officials  there 
were  talking  last  week  o(  how  to 
"integrate"  their  program  with  the 
commercial  venture,  as  if  there  were 
no  point  in  the  Government  continu- 
ing its  sequencing  effons,  and  of 
switching  their  program  from  se- 
quencmg  to  understandmg  how  the 
genome  works.  But  as  the  rival  com- 
mercial venture  has  come  under 
criticism  from  academic  scientists, 
the  officials  no  longer  assume  it  is  a 
probable  (ait  accompli.  The  new 
company  will  produce  only  a  "rough 
draft"  of  the  DNA  sequence,  which 
may  not  meet  scientific  needs.  Dr. 
Harold  Varmus,  director  of  the 
N.l.'H.,  wrote  in  a  recent  letter  to  The 
New  York  Times. 

Dr.  John  E.  Sulston.  director  of  the 
Sanger  Centre,  criticized  Dr.  J.  Craig 
Venter,  the  head  of  the  new  venture, 
for  opimg  out  of  the  international 
collaboration  among  academic  cen- 
ters, and  for  his  plan  to  leave  gaps  in 
pans  of  the  sequence  "I  really  don't 
see  this  as  being  any  great  advance 
whatever,"  he  said.  "We  are  going  to 
provide  the  complete  archival  prod- 
uct and  not  an  intermediate,  transito- 
ry version  of  it." 

The  Sanger  Centre  has  sequenced 
a  third  of  the  human  DNA  now  in  the 
data   banks,   a   larger    contribution 


Politics  swirls 
about  a  glittering 
scientific  prize. 


than  that  of  any  other  institution. 

The  fighting  words  from  the  N.I.H 
and  the  Wellcome  Trust  suggest  that 
these  two  agencies  are  not  about  to 
(old  their  hands  and  will  continue  to 
sequence  the  human  genome  in  com- 
petition with  the  new  company.  This 
venture,  which  has  yet  to  be  named. 


is  being  financed  by  the  scientidc 
instrument  maker  Perkin-Elmer, 
under  the  direction  of  Dr.  Venter,  a 
leading  DNA  sequencer  and  presi- 
dent of  the  Institute  for  Genomic 
Research  in  Rockville.  Md. 

Congress  will  presumably  face  the 
decision  of  whether  to  continue  pay- 
ing (or  N.I.H.  to  sequence  the  ge- 
nome, possibly  both  lagging  and  du- 
plicating Dr.  Venter's  e(fon,  or  to 
have  the  N.I.H.  switch  the  emphasis 
of  its  program  to  interpreting  the 
genome.  Sequencing  the  genomes  of 
much-studied  laboratory  animals 
like  the  mouse  and  the  Drosophila 
fruitfly  would  be  a  major  part  of  an 
Interpretive,  post-genomic  program 
but  doubtless  less  glamorous,  in  Con- 
gress's eyes,  than  obtaining  the  hu- 
man genome  sequence. 

Dr.  Venter,  a  scientist  who  prizes 
his  independence  and  has  seldom 
been  averse  to  criticizing  the  scien- 
tific establishment,  says  his  critics 
are  reacting  from  emotion  and  an 
incomplete  understanding  of  what  he 
proposes  to  do.  Despite  the  commer- 
cial basis  of  his  new  venture,  he  says 
he  will  attain  the  same  accuracy  — 
no  more  than  one  error  in  10,000  units 
of  DNA  —  as  the  academic  centers. 
On  the  issue  of  completeness.  Dr. 
Venter  acknowledges  he  will  leave 
cenain  gaps  in  the  genome  sequence 
but  he  and  his  critics  differ  on  the 
significance.  Dr.  Roben  Waterston.  a 
leading  DNA  sequencer  at  the  Uni- 
versity of  Washington  in  St.  Louis, 
said  the  quality  of  Dr.  Venter's  se- 
quence will  be  "very  significantly 
compromised,"  with  the  final  prod- 
uct being  similar  to  "an  encyclope- 
dia ripped  to  shreds  and  scattered  on 
the  door." 

Dr.  Venter  said  he  planned  to  leave 
no  gaps  in  the  genes  themselves  or  in 
any  imponant  region  between  the 
genes.  "These  argumenu  and  debate 
are  over  less  than  100th  of  1  percent 
of  the  genome,"  he  said. 
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Dr.  Venter  icnows  that  If  his 
project  succeeds,  he  will  force  a  ma- 
jor adjustment  on  his  academic  com- 
petitors. He  alternates  between  of- 
fering balm  and  salt  for  his  rivals' 
wounds.  He  says  he  seeks  to  cooper- 
ate with  other  centers  and  will  share 
his  raw  data,  the  chromatographic 
traces  from  the  DNA  sequencing  ma- 
chines, on  request.  But  he  also  says 
he  plans  to  sequence  the  genome  of 
the  Drosophila  fruitfly,  an  imponant 
laboratory  organism,  as  a  trial  run 
•  for  the  humaji  sequence,  and  adds. 
"We  are  going  to  do  the  Drosophila 
genome  in  one-tenth  the  time  of  the 
C.  elegans  sequence  and  more  accu- 
rately." 

This  is  a  jibe  at  Dr.  Sulston  and  Dr. 
Waterston,  who  expect  to  complete 
the  DNA  sequence  of  the  C.  elegans 
nematode  worm,  another  imponant 
laboratory  organism,  by  the  end  of 
this  year.  This  spectacular  achieve- 
ment will  mark  the  first  animal  ge- 
nome to  be  sequenced. 

Dr.  Sulston  and  Dr.  Waterston 
have  collaborated  for  many  years  in 
a  friendship  that  began  in  Cam- 
bridge. They  chose  the  worm  ge- 
nome as  the  pilot  project  for  their 
assault  on  the  human  genome 

They  and  Dr.  Venter  are  well 
known  as  pioneers  in  the  field  of 
genomics,  the  study  of  an  organism's 
full  set  of  genes.  Dr.  Sulston  and  Dr. 
Waterston  have  been  influential  in 
setting  the  technical  standards  of  the 
human  genome  project  and  the  ethi- 
cal standards  for  making  data  im- 
mediately available  to  other  re- 
searchers. Dr.  Venter  has  pioneered 
the  sequencing  of  bacterial  genomes, 
a  flourishing  new  field  that  is  likely 
to  have  a  broad  impact  on  medicine. 
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Gene-Mapping,  Without  Tax  Money 


By  William  A.  Haseltine 


A-^n 


ROCKVILLE.  Md. 

Sometimes,  It's  smart  not 
L  to  compete.  .The  Ener- 
gy Departmetn  and  the 
I  National  institutes  o( 
I  Health  are  spending  $3 
billion  to  decxNle  the  en- 
tire human  genetic  structure  by  2005 
But  this  effon  has  recently  been  up- 
staged by  a  new  private  company 
founded  by  Dr  J.  Craig  Venter,  pres- 
tdenl  ot  the  nonprofit  Institute  for 
Genomic  Research,  and  the  Perkin- 
Elmer  Corporation  This  venture, 
vhich  Kill  spend  about  1200  million. 
pnnnises  to  complete  the  job  m  a 


mere  three  years.  In  response,  the 
Wellcome  Trust,  a  British  founda- 
tion, pledged  to  double  its  J185  mil- 
lion grant  to  a  nonprofii  laboratory 
for  similar  work. 

Decoding  the  entire  genome  would 
surely  be  a  glittering  scientific 
achievement  and  may  lead  to  some 
scientific  breakthroughs.  And  know- 
tng  how  individual  genes  work  and 
how  they  fail  is  the  key  to  discover- 
ing new  ways  to  predict,  detect,  treat 
and  cure  many,  if  not  most,  diseases 

But  there  is  a  good  reason  that  the 
Federal  Government  should  end  its 
effort:  decoding  the  entire  genome 
doesn't  add  significantly  to  the  infor- 
mation we  already  possess. 

Imagine  that  the  genome  is  an 


encyclopedia  with  about  three  billion  • 
letters.  Buried  within  this  text  are 
about  100,000  sentences  (the  genes) 
that  tell  the  body  what  essential  pro- 
tems  to  make 

The  sentences  are  separated  from 
one  another  by  page  after  page  of 
random  leners  —  what  scientists  call 
juiik  DNA.  To  make  matters  even 
more  complicated,  the  sentences 
themselves  are  also  fragmented  and 
interrupted  by  pages  and  pages  of 
random  leners  —  more  junk  DNA.Jn 
fact,  lee  than  i  percent  of  our  DNS 


contJIb  real  talonnation.  i  he  anXt 
§S  percent  has  n6  genetiTmeaning 

How  do  we  Itnow  this  is  really 
true?  WeVe  already  decoded  3  per- 
cent of  the  entire  genome  And  this  is 
the  picture  we  get 

Each  o(  the  human  genome 
proiects.  however,  seeks  to  read  the 
entire  text  from  beginning  to  end  — 
regardless  of  whether  the  informa- 


tion is  useful.  And  regardless  of  the 
fact  that  we've  already  decoded  the 
useful  DNA. 

About  eight  years  ago,  a  new 
means  to  discover  genes  using  com- 
puterized robots  was  developed.  This 
method  takes  advantage  of  the  fact 
that  the  human  body  is  an  excellent 
editor,  that  It  can  splice  together  the 
gene  fragments  to  form  a  coherent 
text. 

Instead  of  searching  (or  relevant 
gene  fragments  within  ]unk  DNA. 
the  new  robotic  method  ignores  the 
Junk  DNA  and  isolates  only  the 
body's  edited  text.  This  new  method 
has  been  used  to  discover  about 
100.000  useful  genes  —  almost  a  com- 
plete set  (My  company  has  filed 
patents  on  more  than  SOO  of  these 
genes.)  This  information  is  now 
available  for  medical  research: 
much  of  It  is  even  on  the  World  Wide 
Web. 

V'  "  milH'"'  '""''  '""'  ^°'  ^^ 
Federal  Government  to  to  to  the 
trouble  of  decpdme  the  lunk  DNA. 
Today's  task  is  to  discover  the  metii- 
cal  uses  of  each  gene  and  to  find 
gene-based  cures  for  cancer,  hean 
disease,  Alzheimer's,  osteoportKis 
and  other  diseases.  The  S3  billion  ot 
Federal  money  now  devoted  to  the 
entire  human  genome  should  be 
spent  instead  on  university-based  re- 
search, initiated  by  mdrvidual  medi- 
cal investigators. 

The  era  of  government-sponsored 
big  science,  in  which  a  few  laborato- 
ries receive  as  much  as  $10  million  a 
year  to  analyze  mostly  junk  DNA. 
while  scientists  doing  disease-relat- 
ed research  beg  for  f  utanong.  should 
end 

Let  prtvate  companies  and  chari- 
table foundations  finish  the  job  of 
sequencing  the  human  genome.  Na- 
tional pride  should  come  fnmt  con- 
quest of  disease,  not  winiui>g  a  race 
that  is  not  worth  wmiUng 

William  A  Haseltine.  a  pnfeisor  oi 
Harvard  Medical  School  from  1976  to 
1993.  IS  chief  executive  officer  of 
Human  Genome  Scitncey  which 
does  gene  research  From  1992  lo 
1996,  his  company  helped  finance  the 
Instituie  for  Genomic  Research. 
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THE  DUO  JOLTING 
THE  GENE  BUSINESS 

Craig  Venter  and  Perkin-Eliner  target  the  human  genome 


In  [ate  1997,  an  ambitious  idea  oc- 
curred to  technology  guru  Michael 
W.  Hunkapiller  of  Perkin-Elmer 
Corp.  Hunkapiller's  team  was  devel- 
oping a  robotic  machine  that  promised  to 
decipher  human  genes  far  faster  and 
more  cheaply  than  any  previous  system. 
Why  not  use  the  new  device, 
Hunkapiller  wondered,  to  tackle  one  of 
the  biggest  prizes  in  all  of  biology — suc- 
cessfully deciphering  the  entire  human 
genetic  code?  He  brought  his  idea  to 
gene  sleuth  extraordinaire  J.  Craig  Ven- 
ter, president  of  the  nonprofit  Institute 


for  Genomic  Research  in  Rockville,  Md. 
The  result,  announced  on  May  9.  is  a 
still  unnamed  company  that  will  deci- 
pher what  one  "might  describe  as  the 
full  Monty — the  entire  genome."  says 
Venter.  With  some  230  of  the  new 
S300.000  Perkin-Elmer  machmes  run- 
ning around  the  clock.  Venter  and  col- 
league Mark  Adams  figure  they  can 
break  the  3  billion  individual  units  of 
human  dna — the  genome — into  pieces 
and  decode  a  staggering  100  million  in- 
dividual units  a  day.  They  plan  to  finish 
the  genetic  code  in  three  years,  at  a 


FORMIDABLE 
Venter  plans  to  finish  the 
genetic  code  in  three 
years— ■nith  Perldn-Elmer 
pickingupthetab 


total  cost  of  about  $200  million — 
with  Perkin-Elmer  picking  up  the 
tab.  That  is  a  fraction  of  what  the 
federal -government  is  spending  to 
complete  the  task — and  Venter 
vows  to  finish  four  years  sooner. 

What's  more,  Venter  and 
Perkin-Elmer  will  give  away  the 
entire  human  DNA  sequence,  just 
as  the  govenunent  plans  to  do. 
■We  agreed  it  would  be  morally 
wrong  to  hold  the  data  hostage," 
says  Venter.  The  gamble  for 
Peridn-Elmer — a  pioneer  in  gene 
sequencing — is  that  it  can  make 
money  by  selling  information  about 
what  the  sequence  means,  as  well 
as  finding  new  genes  for  develop- 
ing medical  therapies. 
■VLACK  EYE.-  The  announcement 
sent  shock  waves  through  the  red- 
hot  field  of  gene-mining.  This  dis- 
cipline, called  genomics,  is  already 
populated  by  dozens  of  companies 
(table,  page  72)  and  academic  labs 
seeking  to  understand  and  profit 
from  DNa's  secrets.  Companies 
such  as  Human  Genome  Sciences 
Inc.  (HGS)  and  Incyte  Pharmaceu- 
ticals Inc.  have  already  made  millions 
selling  access  to  their  private  stashes 
of  gene  sequences.  But  the  new  compa- 
ny is  a  formidable  competitor— "a  1,000- 
pound  gorilla,"  says  analyst  Elizabeth 
Silverman  of  BancAmerica  Robertson 
Stevens.  Adds  Randal  W.  Scott,  presi- 
dent of  Incyte  in  Palo  Alto,  Cali£  "This 
puts  a  new  competitor  into  play."  And 
the  idea  that  a  pnvate  company  can 
soundly  beat  the  existing  taxpayer-fimd- 
ed  effort  to  the  prize  "is  a  tremendous 
black  eye  for  the  government,"  says 
William  A.  Haseltine.  CEO  of  HCS.  'They 
will  lose  the  race  to  the  genome." 

But  the  venture  also  raises  a  host  of 
questions.  Does  the  massive  private  ef- 
fort mean  that  the  government's  Hu- 
man Genome  Project  should  redirect  its 
efforts?  And  will  Perkin-Elmer  actually 
be  able  to  make  money  from  its  radi- 
cally different  business  plan? 

On  the  science,  few  are  betting 
against  Venter.  "There's  no  question 
that  the  person  who  can  put  together  an 
operation  like  this  and  make  more  head- 
way than  anyone  else  is  Craig  Venter." 
says  Stanford  University  biochemist  and 
Nobel  laureate  Paul  Berg.  Back  in  the 
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mid-1990s.  Venter  pioneered  a  "shot- 
gun" approach  to  deciphering  entire 
genomes.  The  idea  was  to  chop  the  dna 
of  an  organism  into  pieces,  decipher 
each  of  them,  and  then  use  computers 
to  compare  and  assemble  them  in  the 
right  order.  Using  the  technique.  Venter 
astounded  the  scientific  world  by  de- 
coding the  first  complete  genetic  se- 
quence of  a  living  organism — a  bacteri- 
um called  Haemcrphilus  infiuenzae. 

Perkin-Elmer's  new  machines  will 
speed  up  the  process.  Its  Applied 
Biosystems  Division  sold  $650  million 
worth  of  DNA  sequencers  and  related 
instruments  and  services  in  fiscal  1997. 
The  new  tool,  available  next  year,  *is  an 
evolution  of  our  cur- 
rent system,"  says 
HunkapUler.  Its  im- 
proved sensitivity  and 
autonnation  will  dra-; 
matically  boost  pro- 
ductivity. 

DATA  FLOOD.  Venter 
is  hinting  that  the 
government's  genome 


cal.  "The  human  genome  project  has 
never  been  a  commerdal  venture,"  he 
says.  "This  is  more  in  the  tradition  of 
the  Mellons  and  Camegies" — funding  a 
project  that  promises  mainly  to  push 
back  the  bounds  of  knowledge. 

Perkin-Elmer  execs  insist  that  their 
proposal  has  been  misunderstood.  "People 
still  don't  see  how,  if  we  give  away  the 
data,  we  will  make  money,"  sighs  ceo 
Tbny  L.  White,  as  he  patiently  explains 
the  plan.  Stanford's  Berg  says  that  "the 
big  game  is  how  to  make  use  of  the  in- 
formation," and  that's  the  information 
White  plans  to  sell  Rival  Incyte  is  al- 
ready an  old  hand  at  this.  In  fid,  one  of 
its  products  is  a  repackaging  of  publicly 


WHO'S  WHO  IN  GENES 

Craig  Vaiter's  new  venture  is  entering  a  crowded  field:  Here  are 
some  key  players  that  want  to  unlock  the  secrets  of  genes: 
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of  animals  instead  of 
people.  That's  not 
likely.  Dr.  Francis 
Collins,  head  of  the 
National  Institutes  of 
Health's  genome  cen- 


'micTKfilp  firiife.tb  d^^pid^ienff'analysis. 


genetic  variations.  And  companies  such 
as  Affymetrix  Inc.  will  benefit,  analysts 
predict.  Affymetrix  makes  gene  chips, 
which  can  almost  instantly  spot  the 
presence  of  thousands  of  different  genes 
or  gene  variations. 

DRUG  DEVELOPMENT.  Perkin-Elmer 
should  also  benefit  The  $1.4  bQiion  com- 
pany has  moved  aggressively  to  acquire 
companies  and  new  technology,  trans- 
forming Perkin-Elmer  from  an  instru- 
ment maker  to  one  that  provides  ser- 
vices and  information  as  well.  Since 
White  took  over  in  1995,  the  company 
has  acquired  TVopix,  a  leader  in  screen- 
ing drug  candidates,  and  GenScope,  de- 
veloper of  gene  expression  technology, 
and  forged  partner- 
ships- with  other 
players.  For  instance, 
it  teamed  up  last 
-June_with  gene-chip 
developer  Hyseq  Inc. 
whose  products  can 
be  used  to  search  for 
gene  variations. 

Venter's  and 

Perkin-Elmer's  ven- 
ture may  also  profit 
from  new  genes  that 
Venter  finds.  The 
main  current  ap- 
proach for  finding 
genes  involves  fishing 
out  those  that  are  ac- 
tually turned  on  in 


^_,_    _  MERCK  Funded  a  Uif)^gene-tiunfing  project  at  Washington  University, ^ 
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even  if  Venter  sue-  AXYS  Finding-genes  fof  diseases  such  as  asthma,  then  searching  for  gold.  That's  because 


drugs  to  tackle  the  diseases. 


ceeds,  making  sense 
of  the  flood  of  infor- 
mation won't  be  easy.  Only  about  Z% 
of  human  genetic  material  is  actual 
genes.  Some  of  the  remaining  97%  of 
the  DNA  turns  genes  on  and  off,  and 
scientists  think  that  much  of  the  rest  is 
meaningless  junk.  Part  of  Venter's  job 
will  be  to  figure  out  what's  what,  and 
that  could  be  tough.  *The  genes  jump 
right  out  at  you  in  mi<Tobial  sequences." 
says  Richard  K.  Wilson  of  Washington 
University's  gene-sequencing  center  "In 
humans,  it's  much  more  difficult." 

Many  are  confident  of  Venter's  sci- 
entific claims,  but  the  business  end  of 
this  venture  is  another  story.  Perkin- 
Elmer  faces  an  uphill  battle  convincing 
the  biotech  world  that  this  is  a  money- 
making  idea.  "What  they're  describing  is 
not  a  commercial  venture,"  says  Incyte's 
Scott.  "It's  really  Craig  Venter  going 
after  the  Nobel  prize  for  sequendng  the 
genome."  hcs's  Haseltine  is  also  skepti- 


available  data  in  more  usable  form,  says 
analyst  Mike  G.  iOng  of  Vector  Securities 
International.  Haseltine  wonders  how 
Peridn-Elmer  can  do  this  "better  than 
the  rest  of  the  world  combined" 

Venter  and  Perkin-Elmer  execs  re- 
tort that  the  new  company  will  have 
enough  experience  and  smarts  to  be  a 
leader  in  this  toughly  competitive  field. 
They  envision  signing  up  hundreds  of 
thousands  of  subscribers — both  compa- 
nies and  academics — for  a  database  that 
offers  such  vital  information  as  which 
sequences  are  genes,  what  the  genes 
do,  and  how  genes  can  vary  from  person 
to  person.  Such  variations,  called  "poly- 
morphisms." determine  whether  indi- 
viduals are  susceptible  to  certain  dis- 
eases or  how  well  drugs  will  work. 
Doctors  and  pharmaceutical  companies 
can  use  the  information  to  better  diag- 
nose and  treat  people  based  on  their 


some  genes  may  turn 
on  too  rarely  to  be 
discovered.- He  estimates  that  by  se- 
quencing the  entire-genome,  hell  find 
10,000  to  20,000  new  genes.  Many  will 
be -genes  for- Jrital.  signaling  pathways 
in  the  body  and  brain — ideal  candidates 
or  targets  for  drugs.  As  a  result,  "these 
genes  will  have  tremendous  value  on 
their  own,"  he  says.  He  expects  the 
new  company  to  pluck  out  a  few  hun- 
dred of  the  most  promising  to  patent 
and  use  for  drug  development 

The  risks,  of  course,  are  high.  Hasel- 
tine and  others  think  the  new  company 
may  very  well  succeed  at  deciphering 
the  entire  human  genome.  Making  mon- 
ey, however,  will  be  harder.  Venter 
knows  that,  but  thinks  hell  prove  the 
skeptics  wrong  within  a  year.  By  then, 
he  and  his  supporters  believe,  the  new 
tools  will  prove  their  worth,  and  vindi- 
cate Venter's  hxmches  once  again. 

By  John  Carey  in  Washingtcn 
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policy:  Biomedicine 


An  Independent  Perspective  on 
the  Human  Genome  Project 


Steven  E.  Koonin 


The  U.S.  Human  Genome  Project  (HOP) 
IS  a  joint  effort  of  the  Department  of  En- 
ergy and  the  National  Institutes  of  Health, 
formally  initiated  in  1990.  Its  stated  goal  is 
".  .  .  to  characterize  all  the  human  genetic 
material — the  genome — by  improving  ex- 
isting human  genetic  maps,  constructing 
physical  maps  of  entire  chromosomes,  and 
ultimately  determining  the  complete  se- 
quence ...  to  discover  all  of  the  more  than 
50,000  human  genes  and  render  them  ac- 
cessible for  further  biological  study."  The 
original  5-year  plan  was  updated  and  modi- 
fied in  1993  (i .  2). 

DOE'S  Office  of  Biological 
and  Environmental  Sciences  re- 
cently chartered  the  JASON 
group  to  review  the  DOE  compo- 
nent of  the  HOP  This  group, 
mainly  consisting  of  physical  and 
information  scientists,  was  asked 
to  consider  three  areas:  technol- 
ogy-, qualit>'  assurance  and  quality 
control,  and  informatics.  This  ar- 
ticle summarizes  the  group's  find- 
ings and  recommendations  (3). 

Technology.  The  present  state 
of  the  art  for  determining  the  se- 
quence of  DNA  is  defined  by 
Sanger  sequencing,  in  which 
DNA  fragments  are  labeled  by 
fluorescent  dyes  and  separated 
according  to  length  with  poly- 
acrylamide  gel  electrophoresis 
(PAGE)  (4).  The  base  at  the  end  of  each 
fragment  can  then  be  vi>uali:ed  and  identi- 
fied by  the  dye  with  which  it  reacts.  Al- 
though more  than  95%  of  the  genome  re- 
mains to  be  sequenced,  roughly  55 
megabases  (Mb)  have  been  completed  in 
the  past  year  (see  the  figure).  The  world's 
large-scale  sequencing  capacity  (not  all  of 
which  IS  applied  to  the  human  genome)  is 
estimated  to  be  roughly  100  Mb  per  year.  It 
is  sobering  to  contemplate  that  an  average 
production  of  400  Mb  will  be  required  each 
year  lo  complete  the  human  sequence  by 
the  target  date  of  2005. 
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The  present  technology  has  only  a  lim- 
ited read-length  capability  (the  number  ot 
contiguous  bases  that  can  be  identified 
from  each  fragment);  the  best  current  prac- 
tice can  read  700  to  800  bases,  with  per- 
haps 1000  bases  as  the  ultimate  limit.  Be- 
cause the  DNA  segments  of  interest  are 
much  longer  than  this  (40  kilobases  (kb)  for 
a  cosmid  clone;  100  kb  or  more  for  a  bacte- 
nal  artificial  chromosome  or  a  gene],  the 
present  technology  requires  that  long  lengths 
of  DNA  be  cut  into  overlapping  short  seg- 
ments (-1  kb  in  length)  that  can  be  se- 
quenced directly.  The  sequences  from  these 


Percentage  of  the  human  genome  sequenced  to  date.  Almost  3%  of 
the  genome  has  been  sequenced  in  contiguous  stretches  longer  than 
10  kb  and  is  now  deposited  in  publicly  accessible  databases  Compiled 
by  J.  Roach,  as  descnbed  in  httpy/weberu  Washington  edu/-roach/ 
human_genome_progress2  html , 


shorter  pieces  must  then  be  assembled  into 
the  final  sequence.  Up  to  50%  of  the  ef- 
fort at  some  sequence  centers  goes  into 
[his  final  assembly  and  finishing  of  the  se- 
quence. The  ability  to  read  longer  frag- 
ments would  step  up  the  pace  and  quality 
of  sequencing. 

Apan  from  the  various  genome  projects, 
however,  there  is  little  pressure  to  achieve 
longer  read  lengths.  The  5(X)  to  100  base 
lengths  read  by  the  current  technology  are 
well  suited  to  many  scientific  needs,  includ- 
ing pharmaceutical  searches,  studies  of  some 
polymorphisms,  and  studies  oi  some  genetic 
diseases. 

Other  drawbacks  of  the  present  technol- 
ogy include  the  time-  and  labor-intensive 
nature  oi  gel  preparation  and  running,  as 
well  as  the  comparatively  large  amounts  of 


sample  required,  which  also  increases  the 
cost  of  reagents  and  necessitates  extra  am- 
plitic.Kion  steps. 

Thus,  the  present  sequencing  technology 
leaves  much  to  be  desired  and  must  be  sup- 
planted in  the  long  term  if  the  potential  for 
genomic  science  is  to  be  fully  realized. 
Promising  methods  that  could  be  cheaper 
and  faster  than  PACE  include  single-mol- 
ecule sequencing,  mass  spectrometric  meth- 
ods, hybridization  arrays,  and  microtluidic 
capabilities.  None  of  these  is  sufficiently 
mature,  however,  to  be  a  candidate  for  near- 
term  major  scale-up.  It  is  therefore  impor- 
tant to  support  research  aimed  at  improving 
the  present  method.  Advances  in  hardware 
development  could,  for  example,  increase 
the  lateral  scan  resolution  of  the  machine  so 
that  more  lanes  of  a  gel  can  be  analyzed. 
The  genome  community  should  unify  its  ef- 
forts to  enhance  the  performance  of 
present-day  ii\struments. 

Better  software  will  improve  the  lane 
tracking,  base  identification,  assembly,  and 
finishing  processes.  Many  of  the  problems  of 
base  identification  also  occur  in  the  de- 
modulation of  signals  in  com- 
munication and  magnetic  re- 
cording systems,  and  some  of  the 
existing  literature  in  these  areas 
should  be  used  by  the  HGP.  The 
ability  to  correctly  assemble  a  fi- 
nal sequence  without  manual 
editing  would  markedly  speed 
up  the  process.  It  would  also  be 
helpful  to  develop  a  common  set 
of  finishing  rules. 

Because  sequencing  technol- 
ogy should  (and  is  likely  to) 
evolve  rapidly,  the  large-scale 
sequencing  centers  must  be  flex- 
ible enough  to  incorporate  new 
technologies.  There  is  a  great 
need  to  support  the  develop- 
ment of  non-PAGE-based  se- 
quencing that  goes  beyond  the 
current  goals  of  a  faster  version  of  PAGE. 
The  funding  for  such  advanced  technology 
IS  a  small  fraction  of  the  total  HGP  but 
should  be  increased  by  approximately  50%. 
Qualivy  assurance  and  quality  control. 
DOE  and  NIH  are  recogniring  that  the 
HGP  must  make  data  accuracy  and  data 
quality  integral  to  its  execution.  A  high- 
quality  database  can  provide  useful,  derisely 
spaced  markers  across  the  genome  and  en- 
able large-scale  statistical  studies.  A  quanti- 
tative understanding  of  data  quality  across 
the  whole  genome  sequence  is  thus  almost 
as  important  as  the  sequence  itself.  Among 
the  top-level  steps  that  should  be  taken  are 
allocating  resources  specifically  for  quality  is- 
sues and  establishing  a  separate  research  pro- 
gram for  quality  assurance  and  control  (per- 
haps a  group  at  each  sequencing  center). 
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The  stated  accuracy  goal  of  the  HGP  is 
one  error  in  10*  bases,  which  is  set  to  be  less 
than  the  polymorphism  rate.  However,  this 
has  been  a  controversial  issue,  as  genomic 
data  of  lower  accuracy  are  still  of  great  util- 
ity. For  example,  pharmaceutical  companies 
searching  for  genes  can  use  short  sequences 
(400  bases)  at  an  accuracy  of  one  error  per 
too  bases.  The  debate  on  error  rates  should 
focus  on  the  level  of  accuracy  needed  for 
each  specific  scientific  objective  or  use  of 
the  genome  data.  The  necessity  of  finishing 
sequences  without  gaps  should  be  subject  to 
the  same  considerations. 

in  the  real  world,  accuracy  requirements 
must  be  balanced  against  what  users  need, 
the  cost,  and  the  capability  of  the  sequenc- 
ing technology  to  deliver  a  given  level  of 
accuracy.  Establishing  this  balance  requires 
an  open  dialogue  among  the  sequence  pro- 
ducers, sequence  users,  and  the  funding 
agencies,  informed  by  quantitative  analyses 
and  experience. 

Assays  should  be  developed  that  can  accu- 
rately and  efficiently  measure  sequence  qual- 
ity. For  example,  it  would  be  appropriate  to 
develop,  distribute,  and  use  "gold  standard" 
DNA  samples  that  could  be  used  routinely  by 
the  whole  sequencir\g  community  for  assessing 
the  quality  of  the  sequence  output. 

Research  into  the  origin  and  propagation 
of  errors  through  the  entire  sequencing  pro- 
cess is  tijlly  warranted.  We  see  two  useful 
outputs  from  such  studies:  (i)  more  reliable 
descriptions  of  expected  error  rates  in  final 
sequence  data,  as  a  companion  to  database 
entries;  and  (ii)  "error  budgets"  to  be  as- 
signed to  different  segments  of  mapping  and 
sequencing  processes  to  aid  in  developing 
the  most  cost-effective  strategies  for  se- 
quencing and  other  needs. 

DOE  and  NIH  should  solicit  and  support 
detailed  Monte  Carlo  computer  simulation 
of  the  complete  mapping  and  sequencing 
processes.  The  basic  computing  methods  are 
straightforward:  a  reference  segment  of 
DN.A  (with  all  of  the  peculiarities  of  human 
sequence)  is  generated  and  subjected  to 
models  of  all  steps  in  the  sequencing  pro- 
cess; individual  bases  are  randomly  altered 
according  to  errors  introduced  at  the  various 
stages;  and  the  final  reconstructed  segment 
or  simulated  database  entry  is  compared 
with  the  input  segment  and  errors  are  noted. 

Results  from  simulations  are  only  as 
good  as  the  models  used  for  introducing 
and  propagating  errors.  For  this  reason, 
the  computer  models  must  be  developed 
in  close  association  with  technical  experts 
in  all  phases  of  the  process  being  studied, 
so  that  they  best  reflect  the  real  world. 
This  exercise  will  stimulate  new  experi- 
ments to  validate  the  error-process  models 
.ind  thus  will  lead  to  increased  experimen- 
tal understanding  of  process  errors  as  well. 


Improved  software  is  needed  to  enhance 
the  ability  of  database  centers  to  check  the 
quahty  of  submitted  sequence  data  before  its 
inclusion  in  the  database.  Many  of  the  cur- 
rent algorithms  are  highly  experimental  and 
will  be  improved  substantially  over  the  next 
5  years.  In  addition,  an  ongoing  software 
quality  assurance  program  should  be  consid- 
ered for  the  large  community  databases, 
with  advice  from  commercial  and  academic 
experts  on  software  engineering  and  quality 
control.  It  is  appropriate  for  the  HGP  to  in- 
sist on  a  consistent  level  of  documentation, 
both  in  the  published  literature  and  in  user 
manuals,  of  the  methods  and  structures  used 
in  the  database  centers  that  it  supports. 
DOE  and  NIH  should  also  decide  on  stan- 
dards for  the  inclusion  of  quality  metrics  for 
base  identification  and  DNA  assembly  along 
with  every  database  entry  submitted. 

Informatics.  Genome  informatics  is  a 
child  of  the  information  age,  a  status  that 
brings  clear  advantages  and  new  hurdles. 
Managing  such  a  diverse,  large-scale,  rapidly 
moving  informatics  effort  is  a  corwiderable 
challenge  for  both  DOE  and  NIH.  The  in- 
frastructure supporting  the  requisite  soft- 
ware tools  ranges  from  small  research 
groups  (for  example,  for  local  special-pur- 
pose databases)  to  large  Genome  Centers 
(for  process  management  and  robotic  con- 
trol systems)  to  community  database  centers 
(for  GenBank  and  the  Genome  Database). 
The  resources  that  each  of  these  groups  can 
put  into  increasing  software  sophistication, 
into  ensuring  ease  of  use.  and  into  quality 
control  vary  widely.  Thus,  in  informatics  ar- 
eas requiring  new  research  (such  as  gene 
finding),  a  broad-based  approach  of  "letting 
a  thousand  flowers  bloom"  is  most  appropri- 
ate. At  the  other  end  of  the  spectrum.  DOE 
and  NIH  must  impose  community-wide 
standards  for  software  consistency  and  qual- 
ity in  areas  of  hiformatics  in  which  a  large 
user  community  will  be  accessing  major  ge- 
nome databases. 

DOE  and  NIH  should  adhere  to  a  bot- 
tom-up. customer  approach  to  informatics. 
Part  of  this  process  would  be  to  encourage 
forums,  including  close  collaborative  pro- 
grams, between  the  users  and  providers  of 
informatics  tools,  with  the  purposes  of  de- 
termining what  tools  are  needed  and  of 
training  researchers  in  the  use  of  new 
methods. 

To  erasure  that  all  the  database  centers  are 
user-oriented  and  that  they  are  providing  ser- 
vices that  are  genuinely  useful  to  the  genome 
community,  each  database  center  should  be 
required  to  establish  its  own  "users  group"  (as 
IS  done  by  facilities  as  diverse  as  the  National 
Science  Foundation's  Supercomputer  Cen- 
ters and  N.-\SA's  Hubble  Space  Telescope). 
Funher,  informatics  centers  must  be  criti- 
cally evaluated  as  to  the  actual  use  of  their 
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information  and  services  by  the" 
community- 
Data  formats,  software  components,  and 
nomenclature  should  be  standardized  across 
the  community.  If  multiple  formats  exist,  it 
would  be  worthwhile  to  invest  in  systems 
that  can  translate  among  them.  Data 
archiving,  data  retrieval,  and  data  manipu- 
lation should  be  modularized  so  that  one  da- 
tabase is  not  overextended,  and  several 
groups  should  be  involved  in  the  develop- 
ment effort.  The  community  should  be  sup- 
porting several  database  efforts  and  promot- 
ing standardized  interfaces  and  tools  among 
those  efforts. 

FiTud  notes.  The  HGP  involves  technol- 
ogv  development,  production  sequencing, 
and  sequence  utilization.  Greater  coupling 
of  these  three  areas  can  only  improve  the 
project.  Technology  development  should  be 
coordinated  with  the  needs  and  ptoblems  of 
production  sequencing,  whereas  sequence 
generation  and  informatics  tools  must  ad- 
dress the  needs  of  data  users.  Promotion  of 
such  coupling  is  an  important  role  for  the 
hinding  agencies. 

The  HGP  presents  an  unprecedented  set 
of  organizational  challenges  for  the  biology 
community.  Success  will  require  setting  ob- 
jective and  quantitative  standards  for  se- 
quencing costs  (capital,  labor,  and  opera- 
tions) and  sequencing  output  (error  rate, 
continuity,  and  amount).  It  will  also  require 
coordinating  the  efforts  of  many  laborato- 
ries of  varying  sizes  supponed  by  multiple 
funding  sources  in  the  United  States  and 
a  broad - 

A  number  of  diverse  scientific  fields 
have  successfully  adapted  to  a  "big  science" 
mode  of  operation  (nuclear  and  particle 
physics,  space  and  planetary  science,  as- 
tronomy, and  oceanography  are  among  the 
prominent  examples).  Such  transitions 
have  not  been  easy  on  the  scientists  in- 
volved. However,  in  essentially  all  of  these 
cases,  the  need  to  construct  and  allocate 
scarce  facilities  has  been  an  important  or- 
ganizing factor.  No  such  centrahzing  force 
is  apparent  in  the  genomics  community, 
but  the  HGP  is  very  much  in  need  of  the 
coordination  it  would  produce. 
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MHiOII  EV[NTS  IN  THE  U.S.  HUHftN  GENOME  PROIECTANO  RELATED  PfiOGRAMS 


1983 

LANL  and  LLNL  begin 
production  of  DNA  clone 
(cosmid)  libraries 
representing  single 
chromosomes. 

1984 

DOE  OHER  and  ICPEMC 
cosponsor  Alta,  Utah, 
conference  highlighting 
the  growing  role  of 
recombinant  DNA 
technologies.  OTA 
incorporates  Alta 
proceedings  into  a  1986 
report  acknowledging 
value  of  human  genome 
reference  sequence. 


DOE  advisory  committee, 
HERAC,  recommends  a 
IS-year,  multidisciplinary, 
scientific,  and  technological 
undertaking  to  map  and 
sequence  the  human 
genome.  DOE  designates 
multidisciplinary  human 
genome  centers. 

*  NIH  NICMS  begins  funding 
of  genome  projects. 


*  Robert  Sinsheimer  holds 
meeting  on  human 
genome  sequencing  at 
University  of  California, 
Santa  Cniz. 

At  OHER,  Charles  DeLisi 
and  David  A.  Smith 
commission  the  first  Santa 
Fe  conference  to  assess  the 
feasibility  of  a  Human 
Genome  Initiative. 


1986 


Following  the  Santa  Fe 
conference,  DOE  OHER 
announces  Human 
Genome  Initiative.  With 
$5.3  million,  pilot  projects 
begin  at  DOE  national 
laboratories  to  develop 
critical  resources  and 
technologies. 


*  Reports  by  OTA  and  NAS 
NRC  recommend  concerted 
genome  research  program. 

HUGO  founded  by  scientists 
to  coordinate  efforts 
internationally. 

*  First  annual  Cold  Spring 
Harbor  Laboratory  meeting 
held  on  human  genome 
mapping  and  sequencing. 

DOE  and  NIH  sign  MOU 
outlining  plans  for 
cooperation  on  genome 
research. 

Telomere  (chromosome 
end)  sequence  having 
implicafons  for  aging  and 
cancer  research  Is  identified 
at  LANL 


1989 


DNA  STSs  recommended 
to  con-elate  diverse  types  of 
DNA  clones. 

DOE  and  NIH  establish 
joint  ELSI  Working  Group. 


1990 

DOE  and  NIH  present  joint 
5-year  U.S.  HGP  plan  to 
Congress.  The  1 5-year 
project  formally  begins. 

Projects  begun  to  mark 
genes  on  chromosome 
maps  as  sites  of  mRNA 
expression. 

R&D  begun  for  efficient 
production  of  more  stable, 
large-insert  BACs. 


1991 


Human  chromosome 
mapping  data  repository, 
CDB,  established. 


1992 


*  Low-resolution  genetic 
linkage  map  of  entire 
human  genome  published. 

Guidelines  for  data  release 
and  resource  sharing 
announced  by  DOE 
and  NIH. 


1993 

International  IMAGE 
Consortium  established  to 
coordinate  efficient 
mapping  and  sequencing  of 
gene-representing  cDNAs. 

DOE-NIH  joint  ELSI  Working 
Group's  Task  Force  on 
Genetic  Information  and 
Insurance  releases 
recommendations. 

DOE  and  NIH  revise  5-year 
goals  [Science  262,  43-46 
(Oct.  1,1993)]. 

*  French  G^n^thon  provides 
mega-YACs  to  the  genome 
community. 

lOM  releases  U.S.  HCP- 
funded  report,  "Assessing 
Genetic  Risks." 

GRAIL  sequence 
interpretation  service  with 
Internet  access  initiated  at 
ORNL. 


ADA  Arrwricans  with  Disabiliti<?s  Act 

ANL  Arcjonne  Nalinnal  L.aboratory 

BAC  bacterial  artificial  chromosome 

cDNA  comptementar)'  deoKyribonucte-ic  acid 

CCAP  Cancer  Genome  Anatomy  Project 

DNA  deoxyribonucleic  acid 

DHHS  Department  of  Health  and  Human  Services  CNIH) 

DOE  Department  of  Energy 

EEOC  Equal  Employment  Opportunity  Commission 

ELS)  ethical,  legal,  and  social  issues 

CDB  Ceitome  Database 

CRAIL  Gene  Recognition  and  Analysis  Internet  Link 

HERAC  Health  and  EnvironmenLal  Research  Advisory  Committee 

HGP  Human  Genome  Project,  Human  Genome  Program 

HUGO  Human  Genome  Organisation 

ICPEMC  international  Commission  tor  Protection  Against 

Environmental  Mutagens  and  Carcinogens 

IMAGE  Integrated  Molecular  Analysts  of  Gene  Expression 

lOM  Institute  of  Medicine  (NAS) 
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1994 


*  Genetic-mapping  5-year 
goal  achieved  1  year  ahead 
of  schedule. 

Completion  of  second- 
generation  DNA  done 
libraries  representing  each 
human  chromosome  by 
LLNLandLBNL. 

-  Genetic  Privacy  Act,  first  U.S. 
HCP  legislative  product, 
proposed  to  regulate 
collection,  analysis,  storage, 
and  use  of  DNA  samples 
and  genetic  information 
obtained  from  them; 
endorsed  by  DOE-NIH  |oint 
ELSI  Working  Group. 

DOE  Microbial  Genome 
Program  launched;  spin-off 
of  HCP. 

LLNL  chromosome  paints 
commercialized. 

SBH  technologies  from  ANL 
commercialized. 

DOE  HGP  Information  Web 
site  activated  for  public  and 
researchers. 


1995 

LANL  and  LLNL  announce 
high-resolution  physical 
maps  of  chromosome  1 6 
and  chromosome  1 9, 
respectively. 

*  Moderate-resolution  maps 
of  chromosomes  3,  11,  12, 
and  22  maps  published. 

*  First  (nonviral)  whole 
genome  sequenced  (for  the 
bacterium  Haemophilus 
influenzae). 

Sequence  of  smallest 
bacterium.  Mycoplasma 
genitalium,  completed, 
displaying  the  minimum 
number  of  genes  needed 
for  independent  existence. 

*  EEOC  guidelines  extend 
ADA  employment 
protection  to  cover 
discrimination  based  on 
genetic  information  related 
to  illness,  disease,  or  other 
conciltions. 


LANL  Los  Alarr-os  National  Laboratory 

LBNL  Liwrence  Berkeley  National  L<iboratory 

LLNL  Lawrence  Livermore  Naiional  Laboratory 

MCP  Microbial  Genome  Project 

MOU  Memorandum  of  Understanding 

mRNA  messenger  ribonucleic  acid 

NAS  Naiional  Academy  of  Sciences 

NCHCR  National  Center  for  Human  Genome  Research  (NIH) 

NCI  National  Cancer  Institute  (NIH) 

NHCRI  National  Human  Genome  Research  Institute  (NIH) 

NICMS  National  Institute  of  General  Medical  .Sconces  (NIH) 

NIH  National  Institutes  of  Health 

NRC  National  Research  Council 

ONER  Office  of  Health  and  Environmental  Research 

ORNL  Oak  Ridge  National  Laboratory' 

OTA  Office  of  Technology  Assessirient 

R&D  Research  and  Development 

SBH  sequencing  by  hybridization 

STS  sequence  tagged  site 

YAC  yeast  artificial  c  hromosome 


Methanococcus  jannaschii 
genome  sequenced; 
confirms  existence  of  third 
major  branch  of  life,  the 
Archaea. 

DOE-NIH  Task  Force  on 
Genetic  Testing  releases 
interim  principles. 

*  Integrated  STS-based 
detailed  human  physical 
map  with  30,000  STSs 
achieves  an  HGP  goal. 

*  Health  Care  Portability  and 
Accountability  Art 
prohibits  use  of  genetic 
information  in  certain 
health-insurance  eligibility 
decisions,  requires  DHHS 
to  enforce  health- 
information  privacy 
provisions. 

DOE-NIH  loint  ELSI 
Working  Group  releases 
guidelines  on  informed 
consent  for  large-scale 
sequencing  projects. 

DOE  and  NCHGR  issue 
guidelines  on  use  of 
human  subjects  for  large- 
scale  sequencing  projects. 

*  SaccharomYces  cerevisiae 
(yeast)  genome  sequence 
completed  by 
international  consortium. 

Sequence  of  the  human 
T-cell  receptor  region 
completed. 

Wellcome  Trust  sponsors 
large-scale  sequencing 
strategy  meeting  in 
Bermuda  for  international 
coordination  of  human 
genome  sequencing. 


1997 

DOE  forms  joint  Genome 
Institute  for  implementing 
high-throughput 
sequencing  at  DOE  HGP 
centers. 

*  NIH  NCHGR  becomes 
NHCRI. 

*  Escherichia  coli  genome 
sequence  completed. 

Second  large-scale 
sequencing  strategy 
meeting  held  in  Bermuda. 

*  High-resolution  physical 
maps  of  chromosomes  X 
and  7  completed. 

Methanobacteriunrt 
thermoautotrophicum 
genome  sequence 
completed. 

Archaeoglobus  fulgidus 
genome  sequence 
completed. 

*  NCI  CCAP  begins. 


*  DOE  had  limited  or  no 
involvement  in  this  event. 
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j  Preface 


More  than  a  decade  ago,  the  Office  of  Health  and  Environmental  Research  (OHER)  of  the  U.S.  Depart- 
ment of  Energy  (DOE)  stnick  a  bold  course  in  launching  its  Human  Genome  Initiative,  convmced  that 
its  mission  would  be  well  served  by  a  comprehensive  picture  of  the  human  genome.  Organizers  recog- 
nized that  the  information  the  project  would  generate — both  technological  and  genetic — would  con- 
tribute not  only  to  a  new  understanding  of  human  biology  and  the  effects  of  energy  technologies  but 
also  10  a  host  of  practical  applications  in  the  biotechnology  industry  and  in  the  arenas  of  agriculture  and  environmental 
protection. 

Today,  the  project's  value  appears  beyond  doubt  as  worldwide  participation  contributes  toward  the  goals  of  determining 
the  human  genome's  complete  sequence  by  2005  and  elucidating  the  genome  structure  of  several  model  organisms  as 
well.  This  report  summarizes  the  content  and  progress  of  the  DOE  Human  Genome  Program  (HOP).  Descriptive 
research  summaries,  along  with  information  on  program  history,  goals,  management,  and  current  research  highlights. 
provide  a  comprehensive  view  of  the  DOE  program. 

Last  year  marked  an  early  transition  to  the  third  and  final  phase  of  the  U.S.  Human  Genome  Project  as  pilot  programs  to 
refine  large-scale  sequencing  strategies  and  resources  were  funded  by  DOE  and  the  National  Institutes  of  Health,  the  two 
sponsoring  U.S.  agencies.  The  human  genome  centers  at  Lawrence  Berkeley  National  Laboratory,  Lawrence  Livermore 
National  Laboratory,  and  Los  Alamos  National  Laboratory  had  been  serving  as  the  core  of  DOE  multidisciplinary  HGP 
research,  which  requires  extensive  contributions  from  biologisLs,  engineers,  chemists,  computer  scientists,  and  mathema- 
ticians. These  team  efforts  were  complemented  by  those  at  other  DOE-supported  laboratories  and  about  60  universities, 
research  organizations,  companies,  and  foreign  institutions.  Now,  to  focus  DOE's  considerable  resources  on  meeting  the 
challenges  of  large-scale  sequencing,  the  sequencing  efforts  of  the  three  genome  centers  have  been  integrated  into  the 
Joint  Genome  Institute.  The  institute  will  continue  to  bring  together  research  from  other  DOE-supported  laboratories. 
Work  In  other  critical  areas  continues  to  develop  the  resources  and  technologies  needed  for  production  sequencing;  com- 
putational approaches  to  data  management  and  interpretation  (called  informatics);  and  an  exploration  of  the  important 
ethical,  legal,  and  social  issues  arising  from  use  of  the  generated  data,  particularly  regarding  the  privacy  and  confidenti- 
ality of  genetic  information. 

Insights,  technologies,  and  infirastructure  emerging  from  the  Human  Genome  Project  are  catalyzing  a  biological  revolu- 
tion. Health -related  biotechnology  is  already  a  success  story — and  is  still  far  from  reaching  its  potential.  Other  applica- 
tions are  likely  to  beget  similar  successes  in  coming  decades;  among  these  are  several  of  great  importance  to  DOE. 
We  can  look  to  improvements  in  waste  control  and  an  exciting  era  of  environmental  bioremedlailon.  we  will  see  new 
approaches  to  improving  energy  efficiency,  and  we  can  hope  for  dramatic  strides  toward  meeting  the  fuel  demands  of 
the  future. 

In  1997  OHER,  renamed  the  Office  of  Biological  and  Environmental  Research  (OBER).  is  celebrating  50  years  of  con- 
ducting research  to  exploit  the  boundless  promise  of  energy  technologies  while  exploring  their  consequences  to  the 
public's  health  and  the  environment.  The  DOE  Human  Genome  Program  and  a  related  spin-off  project,  the  Microbial 
Genome  Program,  are  major  components  of  the  Biological  and  Environmental  Research  Program  of  OBER. 

DOE  OBER  is  proud  of  its  contributions  to  the  Human  Genome  Project  and  welcomes  general  or  scientific  inquiries 
concerning  its  genome  programs.  Announcements  soliciting  research  applications  appear  in  Federal  Register.  Science, 
Human  Genome  News,  and  other  publications.  The  deadline  for  formal  applications  is  generally  midsummer  for  awards 
to  be  made  the  next  year,  and  submission  of  preproposals  in  areas  of  potential  interest  Is  strongly  encouraged.  Further 
information  may  be  obtained  by  contacting  the  program  office  or  visiting  the  DOE  home  page  (301/903-6488, 
Fax:  -^52U genome<^oer.doe.gov,  URL:  http://www.er.doe.gov/production/ober/hug_top.htmt). 


Aristides  Patnnos.  Associate  Director 

Office  of  Biological  and  Environmental  Research 

U.S.  Department  of  Energy 

November  3.  1997 
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Introduction 


Now  completing  its  first  de- 
cade, the  Human  Genome 
Program  of  the  U.S.  De- 
partment of  Energy  (DOE) 
is  the  longest-running 
federally  funded  program  to  analyze  the 
genetic  material — the  genome —  that  de- 
termines an  individual's  characteristics 
at  [he  most  fundamental  level.  Part  of 
the  Biological  and 
Environmental  Re- 
search (BER) 
Program  spon- 
sored by  the 
DOE  Office  of 
Biological  and 
Environmental 
Research 
(OBER*),  the 
genome  program 
is  a  major  com- 
ponent of  the 
larger  U.S.  Hu- 
man Genome 
Project. 

Since  October  1990,  the 
project  has  been  supported  jointly  by 
DOE  and  the  National  Institutes  of 
Health  (NIH)  National  Human  Genome 
Research  Institute  (formerly  National 
Center  for  Human  Genome  Research). 
Together,  the  DOE  and  NIH  components 
make  up  the  world's  largest  centrally  co- 
ordinated biology  research  project  ever 
undertaken. 

The  U.S.  Human  Genome  Project  is  a 
15-year  endeavor  to  characterize  the  hu- 
man genome  by  improving  existing  hu- 
man genetic  maps,  constructing  physical 
maps  of  entire  chromosomes,  and  ulti- 
mately determining  a  complete  sequence 
of  the  deoxyribonucleic  acid  (DNA) 
subunits.  Parallel  studies  are  being  car- 
ried out  on  selected  model  organisms  to 
facilitate  interpretation  of  human  gene 
function. 


The  ultimate  goal  of  the  U.S.  project  is 
to  identify  the  estimated  70.000  to 
100,000  human  genes  and  render  them 
accessible  for  future  biological  study.  A 
complete  human  DNA  sequence  will 
provide  physicians  and  researchers  in 
many  biological  disciplines  with  an  ex- 
traordinary resource:  an  "encyclopedia" 
of  human  biology  obtainable  by  com- 
puter and  available 
to  all. 


genome  (je'nom),  n. 
all  the  genetic. material 
in  the  chromosomes  of 
an  or^ani.sm. 


Sriftttific  and  technical  terms  are 
defined  in  the  Glossan\  p.  lOi.  Mare 
historical  details  and  other  information 
appear  in  the  Appendices  bcf^inning  on 

p.  71. 


For  SO  years,  programs  in  the  OOE  Office  of 
Biological  and  Ennronmental  Research  have  crossed 
traditional  research  boundaries  in  seeking  new 
solutions  to  ettergy -related  biological  and 
en  vironmental  challenges  (see  Appendix  F.  p.  9fi,  and 
httpi/Avww.er.diH-.jjov/producliiHi/ohtr/ober.hljnil). 


•In  1997  ihc  Office  of  Hcaldi  and  Environ- 
mental Research  (OHER)  was  reoamcd 
Office  of  Biological  and  EnvironmeQial 
Research  (OBER). 


Obtaining  the 
complete  se- 
quence by  2(X)5 
will  require  a 
highly  coordinated 
and  focused  inter- 
national effort  generat- 
ing advances  in  biological  methodology; 
instnmientation  (particularly  automa- 
tion); and  computer-based  methods  for 
collecting,  storing,  managing,  and  ana- 
lyzing the  rapidly  growlDg  body  of  data. 

Project  Origins 

The  potential  value  of  detailed  genetic 
information  was  recognized  early;  until 
recently,  however,  obtaining  this  infor- 
mation was  far  beyond  \ht  capabilities  of 
biomedical  research.  DOE  OBER  and  its 
two  predecessor  agencies — the  Atomic 
Energy  Commission  and  the  Energy  Re- 
search and  Development  Administra- 
tion— had  long  sponsored  genetic 
research  in  both  microbial  and  higher 
systems.  These  studies  included  explora- 
tions into  population  genetics;  genome 
structure,  maintenance,  replication,  dam- 
age, and  repair,  and  the  consequences  of 
genetic  mutations.  These  traditional  DOE 
activities  evolved  naturally  into  the  Hu- 
man Genome  Program. 
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OB£R*s  mission  is  described 
mure  fiiJly  in  the  Prugmm 
Management  section  (p.  59) 
of  this  report 


By  1985.  progress  in  genetic  and  DNA 
technologies  led  to  serious  discussions 
in  the  scientific  community  about  initi- 
ating a  major  project  to  analyze  the 
structure  of  the  human  genome.  After 
concluding  that  a  DNA  sequence  would 
offer  the  most  useful  approach  for  de- 
tecting inherited  mutations,  DOE  in 
1986  announced  its  Human  Genome 
Initiative.  The  initiative  emphasized  de- 
velopment of  resources  and  technolo- 
gies for  genome  mapping,  sequencing, 
computation,  and  inft^structure  supp>ort 
that  would  culminate  in  a  complete  se- 
quence of  the  human  genome. 

The  National  Research  Council  issued  a 
report  in  1988  recommending  a  dedi- 
cated research  budget  of  $200  milUon 
annually  for  15  years  to  determine  the 
sequence  of  the  3  billion  chemical  sub- 
units  (base  pairs)  in  the  human  genome 
and  to  map  and  identify  all  human  genes. 

To  launch  the  nation's  Human  Genome 
Project,  Congress  appropriated  funds  to 


IX)E  and  also  to  NIH,  which  had  long 
supported  research  in  genetics  and  mo- 
lecular biology  as  an  integral  part  of  its 
mission  to  improve  the  health  of  all 
Americans.  Other  federal  agencies  and 
foundations  outside  the  Human  Genome 
Project  also  contribute  to  genome  re- 
search, and  many  other  countries  are 
making  important  contributions  through 
their  own  genome  research  projects. 

Coordinated  Efforts 

In  1988  DOE  and  NIH  signed  a  Memo- 
randum of  Understanding  in  which  the 
agencies  agreed  to  work  together,  coordi- 
nate technical  research  and  activities,  and 
share  results.  The  two  agencies  assumed 
a  joint  systematic  approach  toward  estab- 
lishing goals  to  satisfy  both  short-  and 
long-term  project  needs. 

Early  guidelines  projected  three  5-year 
phases,  for  which  the  fu^l  plan  was  pre- 
sented to  Congress  in  1990.  The  1990 


Anticipated  Benefits  of  Genome  Research 


Predictions  of  biology  as  Tbe  science 
of  ihe  21st  century"  have  been  made 
by  observers  as  diverse  as  Microsoft's 
Bin  Gates  and  U.S.  President  Bill 
Clinton.  Already  rcvolutioijizing  biol- 
ogy, genome  rese^ch  has  spawned  a 
burgeoning  biotechnology  industry 
and  is  providing  a  vital  timist  to  tbe 
increasing  prodoctivity  and  perva- 
siveness of  the  life  sciences. 

Technology  and  resources  promoted 
by  the  Htunan  Genome  Project  al- 
ready have  bad  profound  impacts  on 
biomedical  research  and  promise  to 
revolutionize  biological  research  and 
clinical  medicine,  increasingly  de- 
tailed genome  maps  have  aided  re- 
searchers seeMng  genes  associated 
with  dozens  of  genetic  conditions,  in- 
cluding myotottic  dystrophy,  fragile  X 


syndronte,  neuiofthroraatosis  types  I 
and  2,  a  land  of  inherited  colon  cancer. 
.Alzheimer's  disease,  and  familial  t^east 

cancer. 

Current  and  potential  applications  of 
genome  research  will  address  national 
needs  in  molecular  medicine,  waste 
control  and  environmental  cleanup, 
biotechnology,  energy  sources,  and  risk 
assessment. 

Molecular  Medicine 

On  the  horizon  is  a  new  era  of  molecu- 
lar medicine  characterized  less  by  treat- 
ing symptoms  and  more  by  looldng  to 
the  most  fiudameatai  causes  of  disease. 
Rapid  and  more  specific  diagnostic  tests 
will  make  possible  earlier  treatment  of 
countless  maladies.  Medical  researchers 


also  win  be  able  to  devise  novel  therapeu- 
tic regimens  based  on  new  classes  of 
drugs,  immtmotherapy  techniques,  avoid- 
ance of  environmental  conditions  that 
may  trigger  disease,  and  possible  aug- 
mentation or  even  replacement  of 
defective  genes  through  gene  therapy. 

Microbial  Genomes 

in  1994.  taidng  advantage  of  new  capa  - 
bilities  developed  by  the  genome  project, 
DOE  formulated  the  Microbial  Genome 
Initiative  to  sequence  the  genomes  of 
bacteria  useful  in  the  areas  of  energy  pro- 
duction, environmental  remediation,  toxic 
waste  reduction,  and  industrial  process- 
ing. In  the  resulting  Microbial  Genome 
Project,  six  microbes  that  live  under  ex- 
treme conditions  of  temperaltire  and  pres- 
sure have  been  sequenced  completely  as 
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plan  emphasized  the  creatioo  of  chromo- 
some maps,  software,  and  automated 
technologies  to  enable  sequencing. 

By  1993.  unexpectedly  rapid  progress  in 
chromosome  mapping  required  updating 
the  goals  [Science  262,  43-46  (October 
1.  1993)],  which  now  project  through 
1998  (see  p.  5).  This  plan  is  being  re- 
vised again  in  anticipation  of  the  ap- 
proaching high-throughput  sequencing 
phase  of  the  project.  Last  year  marked  an 
early  transition  to  this  phase  as  many 
more  genome  sequencing  projects  were 
funded.  The  second  and  third  phases  of 
the  project  will  optimize  resources,  re- 
fine sequencing  strategies,  and,  finally. 
completely  determine  the  sequence  of  all 
base  pairs  in  the  genome. 

Another  area  of  DOE  and  NIH  coopera- 
tion is  in  exploring  the  ethical,  legal,  and 
social  issues  (ELSI)  arising  fixjm  in- 
creased availability  of  genetic  data  and 
growing  gene  be -testing  capabilities.  The 


two  agencies  established  a  joint  work- 
ing group  to  confront  these  ELSI  chal- 
lenges and  have  cosponsored  joint 
projects  and  woricshops. 

DOE  Genome  Program 

A  general  overview  follows  of  recent 
progress  made  in  the  DOE  Human  Ge- 
nome Program.  Refer  to  the  timeline 
(pp.  ii-iii)  for  other  achievements  to- 
ward U.S.  goals,  including  contribu- 
tions made  outside  DOE. 

Physical  maps 

For  DOE.  an  early  goal  was  to  develop 
chromosome  physical  maps,  which  in- 
volves reconstructing  the  order  of  cloned 
DNA  fragments  to  represent  their  spe- 
cific originating  chromosomes.  (A  set  of 
such  cloned  fragments  is  called  a  library.) 
Critical  to  this  effort  were  the  libraries 
of  individual  human  chromosomes 


^ 


of  Ai^ust  1997  Structural  studies  are 
under  way  to  team  what  is  uruque 
aboBt  the  pnoteins  of  these  organisms — 
the  ultiiBate  aim  being  to  use  &e  mi- 
ctobes  and  their  enzymes  for  such 
practical  ptirposes  as  waste  control 
and  environtnenta!  cleanup. 

Biotechnology 

The  potential  for  commercial  develop- 
ment presents  U.S.  industry  with  a 
wealth  of  i^iportunities.  Sales  of  bio- 
technology products  are  projected  to 
exceed  $20  billion  by  the  year  2000. 
Tlje  getwtne  ^ffoject  already  has 
stimulated  significant  investitient  by 
large  corporations  and  prompted  the 
creation  of  new  biotechnol<^y  compa- 
nies ix^ing  to  capitaBze  on  the  far- 
teaching  implications  of  its  research. 


Energy  Sources 

Biotechnology,  fueled  by  insights  reaped 
from  the  genome  project,  wiD  play  a  sig- 
nificant  tole  in  improving  the  use  of  fos- 
sil-based resources.  Increased  energy 
demands,  projected  over  the  next 
50  years,  require  strategies  to  circumvent 
the  maay  probienis  associated  with 
ttwiay's  dommant  energy  technologies. 
Biotechnology  promises  to  help  address 
these  needs  by  providing  cleaner  means 
for  the  bioconversion  of  raw  materials  to 
refined  products.  In  addition,  there  is  the 
possibility  of  developing  entirely  new 
biotnass-based  energy  sources.  Having 
the  genomic  seqtieoce  of  the  methane- 
producing  microoiganism  Methano- 
coi-cusjannaxchii,  for  example,  will  eti- 
able  researchers  to  explore  the  process  of 
methanogenesis  in  more  detail  and  could 


lead  to  cheaper  prodncdon  of  fuel  - 
grade  methane. 

Risk  Assessment 

Understanding  the  human  genome 
will  have  an  enormous  impaci  on  the 
ability  to  assess  risks  posed  to  indi- 
viduals by  etrvironmertfal  exposure  to 
toxic  agents.  Scientists  kiKiw  that  ge- 
netic diSereoces  make  some  people 
more  sosceplihle — and  odiers  more 
resistant— to  such  agents.  Far  more 
work  must  be  done  to  determine  the 
genetic  basis  of  such  variability.  This 
knowlet^e  will  directly  address 
DOE'S  long-term  mission  to  under- 
stand the  effects  of  low-level 
exposures  to  radiation  and  other 
energy-related  agents,  especially  in 
terms  of  cancer  risk. 
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produced  at  Los  Alamos  National  Labo- 
ratory (LANL)  and  Lawrence  Livermore 
National  Laboratory  (LLNL).  These  librar- 
ies allowed  the  huge  task  of  mapping  and 
sequencing  the  entire  3  billion  bases  in 
the  human  genome  to  be  broken  down  into 
24  much  smaller  single-chromosome 
units.  Availability  of  the  libraries  has  en- 
abled the  participation  of  many  laborato- 
ries worldwide.  Some  three  generations 
of  clone  libraries  with  improving  charac- 
teristics have  been  produced  and  widely 
distributed.  In  the  DOE-supported  proj- 
ects, DNA  clones  representing  chromo- 
somes 16,  19.  and  22  have  been  ordered 
(mapped)  and  are  now  providing  mate- 
rial needed  for  large-scale  sequencing. 

Sequencing 

Toward  the  goal  of  greatly  increasing  the 
speed  and  decreasing  the  cost  of  DNA 
sequencing.  IX)E  has  supported  im- 
provements in  standard  technologies  and 
has  pioneered  support  for  revolutionary 
sequencing  systems.  Marked  improve- 
ments have  been  made  in  reagents,  en- 
zymes, and  raw  data  quality.  Such  novel 
approaches  as  sequencing  by  hybridiza- 
tion (using  DNA  "chips")  and  mass  spec- 
trometry have  already  found  important, 
previously  unanticipated  applications 
outside  the  Human  Genome  Project. 

Joint  Genome  Institute 

In  early  1997.  the  human  genome  centers 
at  Lawrence  Berkeley  National  Labora- 
tory. LANL.  and  LLNL  began  collabo- 
rating in  the  Joint  Genome  Institute 
(JGI).  within  which  high-throughput 
sequencing  will  be  implemented  Isee 
p.  26  and  Human  Genome  News  8(2), 
1-2].  The  initial  JGI  focus  will  be  on  se- 
quencing areas  of  high  biological  interest 
on  several  chromosomes,  including  hu- 
man chromosomes  5.  16.  and  19.  Estab- 
lishment of  JGI  represents  a  major 
transition  in  the  DOE  Human  Genome 
Program. 

Previously,  roost  goals  were  pursued  by 
small-  to  medium-sized  teams,  with 


modest  multisite  collaborations.  The  JGI 

will  house  high -throughput  implementa- 
tions of  successful  technologies  that 
will  be  run  with  increasingly  stringent 
process-  and  quality -control  systems. 

In  addition,  a  small  component  aimed  at 
understanding  how  genes  function  in  the 
body — a  field  known  as  functional  ge- 
nomics— has  been  established  and  will 
grow  as  sequencing  targets  are  met 
High-throughput  functional  genomics 
represents  a  new  era  in  human  biology, 
one  which  will  have  profound  implica- 
tions for  solving  biological  problems. 

Informatics 

In  preparation  for  the  production- 
sequencing  phase,  many  algorithms  for 
interpreting  DNA  sequence  have  been 
developed,  and  an  increasing  number 
have  become  available  as  services  over 
the  Internet.  Last  year,  the  GRAIL  (for 
Gene  Recognition  and  Analysis  Internet 
Link)  and  GenQuesi  servers,  developed 
and  maintained  at  Oak  Ridge  National 
Laboratory,  processed  an  average  of 
almost  40  million  bases  of  sequence 
each  month. 

As  technology  improves  and  data  accu- 
mulates exponentially,  continued  progress 
in  the  Human  Genome  Project  will  de- 
pend increasingly  on  the  development  of 
sophisticated  computational  tools  and 
resources  to  manage  and  interpret  the  in- 
formation. The  ease  with  which  re- 
searchers can  access  and  use  the  data 
will  provide  a  measure  of  the  project's 
success.  Critical  to  this  success  is  the 
creation  of  interoperable  databases  and 
other  computing  and  informatics  tools  to 
collect,  organize,  and  interpret  thousands 
of  DNA  clones. 

For  additional  information  on  the  DOE 
genome  programs,  refer  to  Research 
Highlights,  p.  9;  Research  Narratives, 
p.  25;  this  report's  Part  2.  1996  Re- 
search Abstracts;  and  the  Web  site 
ihap://www.omLgov/hgmis). 
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FIve-Year  Research  Goals 
of  the  U.S.  Human  Genome  Project 

October  1, 1993,  to  September  30, 1998  (FY  1994  through  FY  1998)* 


Mufor  events  in  ike  US.  Human  Genome  Fnyecty  including  progress  made  toward  these 
goals,  are  charted  in  a  timeline  on  pp.  ii-iiL 


Genetic  Mapping 

•  CompletB  <he  2-  to  5-cM  map  by  1995. 

•  Develop  («ehr>o(ogy  lor  rapkj 
genotyping. 

•  Develop  marttars  tttat  are  ^sier  to 
use. 

•  Develop  new  mapping  technofo^es. 

Physical  Mapping 

•  CompteiB  a  sequence  tagged  sSe 
(STS)  map  of  the  human  genome  at 
a  resolution  of  100  kb. 


DMA  Sequencing 

•  Devalop  eSldent  approaches  to  se- 
quendfig  one-  to  several-magabase 
redone  ot  ONA  of  higtt  biofogicai 
Merasl 

-   Develop  teciinotogy  for  tvgh- 
througtiput  sequence,  focusing  on 
systems  Integ^atton  of  ail  aaps  from 
template  preparation  to  <feta  analysis. 

•  Bgia  up  a  sequencing  capacity  to 
a8ow  sequencing  at  a  coSective  rate 
of  50  Mb  per  year  by  the  end  of  the 
period.  T)«8  rate  shoulq  result  In  an 
aggregate  of  80  Mb  of  ONA  sequence 
complied  tiy  fte  end  of  fY  1998. 

Gene  Idetrtlflcation 

•  Develop  efftdent  meSiodsforidentify- 
ing  genes  arid  for  placement  of  known 
genes  on  physical  mtoa  of  sequenced 
DNA. 


Technology  Development 

Substantially  expand  suppott  of  in- 
novative technological  develop- 
ments as  well  as  tmprovemerHs  in 
eurteni  teotwology  fo»  DIMA  se- 
quencing and  for  meeting  8ie  needs 
of  the  Human  Genome  Project  as  a 

Model  Organisms 

-    Finish  an  STS  map  of  the  mouse 
gencme  at  a  SOO-kb  resolution. 

■    Finish  Bie  sequence  of  the  Sschert- 
cfaa  ciWiand  Saccharomyces  caravi- 
slaagenomea  by  1998  or  earlier. 

•  Con&iue  sequer>ctng  Caenortmb- 
dttia  elegansarKi  Dmsophiia 
mela/iogaslergenomea  with  the  aim 
of  brtnging  C  olegans  to  near 
completion  by  1998. 

•  Sequence  selected  segments  of 
mouse  DNA  side  ijy  side  with  corre- 
spomfing  human  DIMA  tn  areas  of 
high  biotogical  interest. 

Informatics 

•  ContlngB  to  create,  develop,  and 
operate  databases  and  database 
toob  tor  easy  access  to  data,  includ- 
ing effective  toots  aid  3t»ufafds  for 
data  exchange  and  links  among 
datatjasas. 

Consolidate,  (Sstribute.  and  continue 
to  develop  effec^e  software  for 
large-ecale  genome  projects. 

Continue  to  develop  tools  for  com- 
paring and  inteipisting  genome 
information. 


Ethical,  Legal,  and  Social 
hn  plications 

•  Ccmtinue  to  iderttify  and  defme 
Issues  and  develop  policy  options 
to  address  them. 

•  Oevekip  and  disseminate  poscy 
opbons  nagardins  genetic  testing 
sevicss  wi*  potenSa)  widespread 
use. 

•  l^ter  greater  acceptance  of 
human  genetic  variation. 

•  Enhance  and  expand  public  and 
professional  education  tfiat  is 
sensi^e  to  sociocultur*  and 
psychokjgtcal  issues. 

Training 

•  Continue  to  encourage  training 
of  sdenttsts  m  interdiscipltnary 
sciences  related  to  genome 
research. 

Technology  Transfer 

•  Encourage  and  enhance  technol- 
ogy bansfBf  both  into  and  out  of 
centers  of  genome  research. 

Outreach 

•  Cooperate  with  those  who  would 
establish  distritiutxin  centers  for 
genome  materials. 

Share  aB  informatton  and  matert- 
aJs  within  6  months  ol  their 
development  This  should  be 
aeconr^jsshed  by  submission  of 
informaSon  to  public  datatrases 
or  repositortes.  or  both,  where 
approphate. 


'Oii^nal  1990  goals  were  revised  in  1993  due  id  rapid  progress.  A  second  cevision  was  being  developed  at  press  tuna. 
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Eyolutiqn  of  a  Vision: 

Genome  Project  Origins, 


In  an  interview  al  a  DNA  sequencing  conjerence  in  Hilton  Head. 
South  Carolina.  *  David  Smith,  a  founder  and  former  Director  of  the 
DOE  Human  Genome  Program,  recalled  the  establisfim^nt  of  this 
country's  first  human  genome  project.  Tlie  impressive  early  a<hieve- 
meiits  and  spin-off  benefits,  he  noted,  offer  more  tlian  mere  vindica- 
tion for  project  founders.  They  also  provide  a  tantalizing  glimpse 
into  the  future  where,  he  observed   "scientists  will  be  empowered  to 
study  biology  and  make  connections  in  ways  urulreamt  of  before.  " 


The  DOE  Human  Genome  Pro- 
gram began  as  a  natural  out- 
growth of  the  agency's 
long-term  mission  to  develop 
better  technologies  for  measur- 
ing health  effects,  panicuiarly  induced  mu- 
tations. As  Smith  explained  ii.  "DOE  had 
been  supponing  mutation  studies  in  Japan. 
where  no  heritable  mutations  could  be  de- 
tected in  the  o^spring  of  populations  ex- 
posed to  the  atomic  blasts  at  Hiroshima  and 
Nagasaki.  The  program  really  grew  out  of  a 
need  to  characterize  DNA  differences  be- 
tween parents  and  children  more  efficiently 
DOE  led  the  development  of  many  muta- 
tion tests,  and  wc  were  interested  in  devel- 
oping even  more  sensitive  detection 
methods  Mortimer  Mendelsohn  of 
Lawrence  Livermore  National  Laboratory, 
a  member  of  the  Intemauonal  Commission 
for  Protection  Against  Environmental 
Mutagens  arKl  Carcinogens,  and  I  decided 
to  hold  a  workshop  to  discuss  DNA-based 
methods  (sec  Human  Genome  Project 
chronology,  p.  ii). 

"Ray  White  (University  of  Utah)  organized 
the  meeting,  which  took  place  in  Alta. 
Utah,  in  December  1984  It  was  a  small 
meeting  but  very  stimulating  intellectually. 
We  concluded  the  obvious — that  if  you  re- 
ally wanted  to  use  DNA-based  technolo- 
gies, you  had  to  come  up  with  more 
efficicm  ways  to  characterize  the  DNA  of 
much  larger  regions  of  the  genome.  And  the 
ultimate  sensitivity  would  be  the  capability 
to  compare  the  complete  DNA  sequences 
of  parents  and  their  oEFspring." 


Project  Begins 


Smith  recalled  reaction  to  the  first  public 
sutement  that  DOE  was  starting  a  program 
with  the  aim  of  sequencing  the  human  ge- 
nome, "1  announced  it  at  the  Cold  Spring 


view.  "In  fact,  individual  investigators  can 
do  things  they  would  never  be  able  to  do 
otherwise  We're  beginning  to  see  that 
demonstrated  at  this  meeting  For  the  first 
lime,  we're  finding  people  exploring  sys- 
tematic ways  of  looking  at  gene  function  in 
organisms.  The  genome  project  opens  up 
enormoas  new  research  fields  to  be  mined 
Cottage- industry  biologists  won't  need  a  lot 
of  robots,  but  they  will  have  lo  be  computer 
literate  to  put  the  information  all  together," 

The  genome  project  also  is  providing  en- 
abling technologies  essential  to  the  future 
of  the  emerging  biotechnology  indastry, 
catalyzing  its  tremendous  growth.  Accord- 
ing to  Srtiith,  the  technologies  arc 


4  4  Genomics  has  come  of  age.  and  it  is 
opening  the  door  to  entirely  new 
approaches  to  biology,  jy 


1  '"Hie  Seventh  Iniematiooal  Geoome  Sequenc- 
ing and  Analysis  Conference.  September  199S 
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Hart>or  meeting  in  May  1986,  and  there  was 
a  big  hullabaloo."  After  a  year-long  review, 
a  National  Academy  of  Sciences  National 
Research  Council  panel  endorsed  the 
project  and  the  basic  strategy  proposed 
Smith  pointed  out  that  NIH  and  others  were 
also  having  discussions  on  the  feasibility  of 
sequencing  the  human  genome  "Once  NIH 
got  interested,  many  more  people  became 
involved  DOE  and  NIH  signed  a  Memo- 
randum of  Undcntanding  in  October  1988 
to  coordinate  our  activities  aimed  at  charac- 
terizing the  human  genome."  But,  he  ob- 
served, it  wasn't  all  smooth  sailing  The 
nascent  project  had  many  detractors. 

Responding  to  Critics 

Many  scientists,  prominent  biologists 
among  them,  thougik^  having  the  sequence 
would  be  a  misuse  of  scarce  resources. 
Smith,  laughing  now,  recalls  one  scientist 
complaimng.  "Even  if  I  had  the  sequence. 
I  wouldn't  know  what  to  do  with  it"  Other 
critics  worried  that  the  genome  project 
would  siphon  shrinking  research  hinds 
away  from  individual  investigator- initiated 
research  projects.  Smith  takes  the  opposite 


capable  of  more  than  elucidating  the  human 
genome.  "We're  developing  an  infrastruc- 
ture for  future  research-  These  technologies 
will  allow  us  to  efficiendy  characterize  any 
of  the  organisms  out  there  that  pertain  to 
varioas  DOE  missions,  with  such  applica- 
tions as  better  fuels  from  biomass, 
bioremediation,  and  waste  control.  They 
also  will  lead  to  a  greater  understanding  of 
global  cycles,  such  as  the  carbon  cycle,  and 
the  identification  of  potential  biological  in- 
terventions. Look  at  the  ocean;  an  amazing 
number  of  microbes  are  in  there,  but  we 
don't  know  how  to  use  them  to  influence 
cycles  to  control  some  of  the  harmful 
things  that  might  be  happening.  Up  to  now, 
biotechnology  has  been  neariy  all  health 
oriented,  but  applications  of  genome  re- 
search to  modem  biology  really  go  beyond 
health-  Thai's  one  of  the  things  motivating 
our  program  lo  try  to  develop  some  of  these 
other  biotechno  logical  applicatioits." 

Responding  to  criticism  about  not  research- 
ing gene  function  early  in  the  project 
Smith  reasserted  that  the  purpose  of  the 
Human  Genome  Project  is  lo  build  tech- 
nologies and  resources  that  wUt  enable  re- 
searchers to  learn  about  biology  in  a  much 
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Present  and  Future  Challenges, 

-    -    -  .-■--"  .    >'     ^  - 

Far-Reaching  Benefits 


more  efficient  way.  ""nic  genome  budget  is 
devoted  to  very  specific  goals,  and  wc 
make  sure  that  projecLs  contribule  toward 
reaching  them." 

International  Scope 

Smith  credited  the  international  community 
with  contributing  to  many  project  suc- 
cesses. "The  initial  ptanmng  was  for  a  U.S. 
project,  but  the  outcome,  of  course,  is  thai 
it  is  truly  intemationaj,  and  we  would  not 
be  nearly  as  far  as  wc  are  today  without 
those  contributions  Also,  there's  been  a  fair 
amount  of  money  from  private  companies, 
and  support  from  the  Muscular  Dystrophy 
Association  in  France  and  The  Wellcome 
Trust  in  the  United  Kingdom  has  been  ex- 
tremely important," 

Technology  Advances 

While  noting  enormous  advances  across  the 
board.  Smith  cited  automation  progress  and 
observed  that  tremendously  powerful  ro- 
bots and  automated  processes  are  changing 
the  way  molecular  biology  is  done.  "A  lot 
of  novel  technologies  probably  won't  be 
useful  for  initial  sequencing  but  will  be 
very  valuable  for  comparing  sequences  of 
different  people  and  for  polymorphism 
studies.  One  of  the  most  gratifying  recent 
siiccesses  is  the  DNA  polymerase  engineer- 
ing project  Researchers  tnade  a  fairly 
simple  change,  but  it  resulted  in  a 
therraosequcnase  thai  may  answer  a  lot  of 
problems,  reduce  the  cost  of  sequencing, 
and  give  us  bcacr  data." 

Progress  in  genome  research  requires  the 
use  of  maturing  technologies  in  other 
fields  "The  combination  of  technologies 
that  are  coming  together  has  been  fortu- 
itous; for  example,  advances  in  informatics 
and  data-handling  technologies  have  had  a 
tremendous  impact  on  the  genome  project 
We  would  be  in  deep  trouble  if  they  were  at 
a  less-m^ure  stage  of  developmenL  They 
have  been  an  important  E>OE  focus." 


ELsr 

Smith  described  tangible  progress  toward 
goals  associated  with  programs  on  the  ethi- 
cal, legal,  and  social  issues  (ELSO  related 
to  data  produced  by  the  genome  projcct. 
"ELSI  programs  have  done  a  lot  to  educate 
the  thinkers,  and  this  has  produced  a  higher 
level  of  discourse  in  the  country  about 
these  is.sucs.  DOE  is  spending  a  large  frac- 
tion of  its  ELSI  money  on  informing  spe- 
cial populations  who  can  reach  others 
Educating  judges  has  been  especially  well 
received  because  they  realize  the  potential 
impact  of  DNA  technology  on  the  courts." 

According  to  Smith,  more  people  and 
groups  need  to  be  involved  in  ELSI  mat- 
ters. "We  have  some  ELSI  products:  the 
DOE-NIH  Joint  ELSI  Working  Grtxip  has 
an  insurance  task  force  report,  and  a  DOE 
ELSI  grantee  has  produced  draft  privacy 
legislation.  Now  it's  tiioe  for  others  to 
come  and  translate  ELSI  efforts  into  policy. 
Perhaps  the  new  National  Bioethics  Advi- 
sory Commission  can  do  some  of  this." 

New  Model  for  Biological 
Research 

Smith  spoke  of  a  changing  paradigm  guid- 
ing DOE-suppoited  biology.  "Some  years 
ago,  the  central  idea  or  dogma  in  molecular 
biology  research  was  that  information  in 
DNA  directs  RNA,  and  RNA  directs  pro- 
teins Today,  I  think  there  is  a  new  para- 
digm to  guide  us:  Sequence  implies 
strucQire,  and  strucoire  implies  function. 
The  word  'implies'  in  our  new  paradigm 
means  there  are  rtJies,"  continued  Smith, 
"but  these  arc  rules  we  don't  underitand 
today.  With  the  aid  of  structural  informa- 
tion, algorithms,  and  computers,  we  will  be 
able  to  relate  sequence  to  structure  and 
eventually  relate  stiticture  to  function.  Our 
effon  fociLses  on  developing  the  technolo- 
gies and  tools  thai  will  allow  as  to  do  this 
cfTiciently." 


"That's  how  I  think  about  what  we  do  at 
DOE."  he  said.  "We're  working  a  lot  on 
technology  and  projects  aimed  at  human 
and  microbial  genome  sequencing.  For  un- 
derstanding sequence  imphcations,  we  are 
making  major,  increasing  investments  in 
synchrotrons,  synchrotron  user  facilities, 
neutron  user  facilities,  and  big  nuclear 
magnetic  resonance  machines  These  are  all 
aimed  at  rapid  structure  determin^on." 
Smith  explained  that  now  we  arc  seeing  the 
begiiuiings  of  the  biotechnology  revolution 
implied  by  the  sequencc-to-stmcture- 
to-function  paradigm.  "If  you  really  under- 
stand the  relationship  between  sequence 
and  function,  you  can  begin  to  design  se- 
quences for  particular  purpo.scs.  We  don't 
yet  know  that  much  about  the  world  around 
us,  but  there  are  capabitiiics  out  there  in  the 
biological  world,  and  if  we  can  understand 
them,  wc  can  put  those  capabilities  to  use." 

"Comparative  genomics."  he  continued, 
"will  teach  as  a  tremendoas  amount  about 
human  evolution.  The  current  phytogenetic 
tree  is  based  on  ribosomal  RNA  sequences, 
bui  when  we  have  determined  whole  ge- 
nomic sequences  of  different  microbes, 
they  will  probably  give  us  different  ideas 
about  relationships  among  archaebacteria, 
eukaryotes.  and  prokaryoies." 

Feeling  good  about  progress  over  the  previ- 
ous 5  years.  Smith  sununed  it  up  suc- 
cinctly: "GciKimics  has  come  of  age,  and  it 
is  opening  the  door  to  entirely  new  ap- 
proaches to  biology." 


David  Smith  rflttrrd  ai  dtg  end  of  January 
1996.  Tuiuns  resporMhilily  for  ihe  DOE 
Human  Gen<fnie  Prugrum  ix  Ari.sfides 
fdtritut's,  whf  is  tj.'so  A.\uit  iatf  Di'ircl-pr 
nfthr  DOE  Officf  nfBiofn^xca}  and  En\4- 
ntnmenlul  Rrsearcli  Margin  Frazier  is 
Oirvclor  ufthe  HeuUh  Effects  and  Ufe 
Sciencex  HeseurrM  Division,  ■•vhuh  man- 
a/ifx  me  Human  Gfnonw  Pnigram. 
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Looking  to  the  Future 


Insights,  technologies,  and  resources  already  emerg- 
ing from  the  genome  project,  together  with  advances 
in  such  fields  as  computational  and  structural  biology, 
will  provide  biologists  and  other  researchers  with  im- 
portant tools  for  the,21st  century. 
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Highlights  of  Research  Progress 


The  early  years  of  the  Hu- 
man Genome  Program 
have  been  remarkably  suc- 
cessful. Critical  resources 
and  infrastructures  have 
been  established,  and  technologies  have 
been  developed  for  producing  several 
useful  types  of  chromosomal  maps. 
These  gains  are  supporting  the  project's 
transition  to  the  large-scale  sequencing 
phase.  Some  highlights  and  trends  in  the 
U.S.  Department  of  Energy's  (DOE) 
Human  Genome  Program  after  FY  1 993 
are  presented  in  this  section. 

Clone  Resources  for 
Mappings  Sequencing, 
and  Gene  Hunting 

The  demands  of  large  chromosomal 
mapping  and  sequencing  efforts  have 
necessitated  the  development  of  several 
different  types  of  clone  collections 
(called  libraries)  carrying  human  DNA. 
Three  generations  of  DOE-developed  li- 
braries are  being  distributed  to  research 
teams  in  the  United  States  and  abroad. 
In  these  libraries,  human  DNA  seg- 
ments of  various  lengths  are  maintained 
in  bacterial  cells. 


genome  researchers  worldwide  [hnp:// 

www-bio. Unigov/genome/html/ 
cosmtd.htm!).  Very  high  resolution  chro- 
mosome maps  based  principally  on 
NLGLP  libraries  were  published  in 
1995  for  chromosomes  16  and  19. 
These  are  described  in  detail  in  the  Re- 
search Narratives  section  of  this  report 
(see  LLNL,  p.  27,  and  LANL.  p.  35). 

PACs  and  BACs 

The  third  generation  of  clone  resources 
supporting  chromosome  mapping  is 
composed  of  PI  artificial  chromosome 
(PAC)  and  bacterial  artificial  chromo- 
some (BAC)  libraries.  A  prototype  PAC 
library  was  produced  by  the  team  of 
Leon  Rosner  (then  at  DuPont)  many 
years  ago,  but  more  efficient  produc- 
tion began  with  improvements  intro- 
duced by  the  DOE-supported  teams 
headed  by  Melvin  Simon  at  Caltech 
(BACs)  and  Pieter  de  Jong  at  Roswell 
Park  (PACs). 

In  contrast  to  cosmids.  BACs  and  PACs 
provide  a  more  uniform  representation 
of  the  human  genome,  and  the  greater 
length  of  their  inserts  (90,000  to 


Transitioning  to 
large-scale  sequencing 


DOE  Genome  Research 
Web  Site 

hap://Mw>v.ornLgov/hgmis/ 
research.html 


NLGLP  Libraries 

The  first  two  generations  are 

chromosome -specific  libraries  carrying 
small  inserts  of  human  DNA  ( 1 5,000  lo 
40,000  base  pairs).  As  part  of  the  Na- 
tional Laboratory  Gene  Library  Project 
(NLGLP)  begun  in  1983.  these  libraries 
were  prepared  at  Los  Alamos  National 
Laboratory  (LANL)  and  Lawrence 
Livermore  National  Laboratory  (LLNL) 
using  DOE  flow-sorting  technology  to 
separate  individual  chromosomes.  Li- 
brary availability  has  allowed  the  very 
difficult  whole-genome  tasks  to  be  di- 
vided into  24  more  manageable  single- 
chromosome  projects  that  could  be 
pursued  at  separate  research  centers. 
Completed  in  1994.  NLGLP  libraries 
have  provided  critical  resources  to 


Research  Narratives 


Separate  narfatives,  be^nning  on  p.  25.  contain  aeaaed 
descriptions  of  reseafch  prograns  and  acoompBshments  at 
these  n^ajof  DOE  genome  research  facitities. 

•  Lawrence  Uvannoie  National  Laboratoiy 

•  Los  AJamos  National  Laboratoiy 

♦  tawfance  Beriteley  National  Laboratory 

*  University  ot  WasNngton  Qenome  Sequencing 
Laboratory 

»     Genome  Database 

♦  National  Center  for  Genome  Resources 


Research  Abstracts 


Desoiiptians  of  individual  research  pio|acts  at  other  institu- 
tions are  given  in  rtw  2  19$6R»S9arch  Abstracts. 


DOE  Human  Genome  Program  Report 


142 


300.000  base  pairs)  facilitates  both 
mapping  and  sequencing.  Their  useful- 
ness was  illustrated  dramatically  in 
1 993  when  the  first  breast  cancer- 
susceptibility  gene  (BRCAl)  was  found 
in  a  BAC  clone  after  other  types  of  re- 
sources had  failed.  The  next  year,  with 
major  support  from  NIH,  de  Jong's  PACs 
contributed  to  the  isolation  of  the  second 
human  breast  cancer-susceptibility  gene 
(BRCAl). 

Mixpping 

The  assembly  of  ordered,  overlapping 
sets  (contigs)  of  higli-qualiry  clones  has 
long  been  considered  an  essential  step 
toward  human  genome  sequencing. 
Because  the  clones  have  been  mapped 
to  precise  genomic  locations,  DNA 
sequences  obtained  from  them  can  be 
located  on  the  chromosomes  with  mini- 
mal uncertainty. 


The  large  insert  size  of  B  ACs  and 
PACs  allows  researchers  to  visually 
map  them  on  chromosomes  by  using 
fluorescence  in  situ  hybridization 
(FISH)  technology  (see  photomicro- 
graph below).  These  mapped  BACs  and 
PACs  represent  very  valuable  resources 
for  the  cytogeneticist  exploring  chromo- 
somal abnormalities.  Two  major  medi- 
cal genetics  resources  have  been 
developed:  ( I )  The  Resource  for  Mo- 
lecular Cytogenetics  at  the  University  of 
California.  San  Francisco,  in  collabora- 
tion with  the  Lawrence  Bericeley  Na- 
tional Laboratory  (LBNL)  team  led  by 
Joe  Gray  {http://nnc-www.lbl.gov)  and 
(2)  The  Total  Human  Genome  BAC- 
PAC  Resource  at  Cedars-Sinai  Medical 
Center.  Los  Angeles,  developed  by  Julie 
Korenberg's  laboratory  (see  map.  p.  12, 
and  Web  site,  http://www.cxmc.edu/ 
genetics/korenberg/korenberg.html). 
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Coordinated  Mapping 
and  Sequencing 

A  simple  strategy  was  proposed  in  1996 
for  choosing  BACs  or  PACs  to  elongate 
sequenced  regions  most  efficiently 
[Nature  381,  364-66  (1996)).  The  first 
step  is  to  develop  a  BAC  end  sequence 
database,  with  each  entry  having  the 
BAC  clone  name  and  the  sequences  of 
its  human  insert  ends,  tn  toto,  the  source 
BACs  should  represent  a  15-  to  20-fold 
coverage  of  the  human  genome.  Then 
for  any  BAC  or  chromosomal  region  se- 
quenced, a  comparison  against  the  data- 
base will  return  a  list  of  BACs  (or 
PACs)  that  overlap  it.  Optimal  choices 
for  the  next  BACs  (or  PACs)  to  be  se- 
quenced can  then  be  made,  entailing 
minimal  everlap  (and  therefore  minimal 
redundancy  of  sequencing). 

Two  pilot  BAC-PAC  end-sequencing 
projects  were  initiated  in  September  of 
1996  to  explore  feasibility,  optimize 
technologies,  establish  quality  controls, 
and  design  the  necessary  informatics  in- 
frastructure. Particular  benefits  are  an- 
ticipated for  small  laboratories  that  will 
not  have  to  maintain  large  libraries  of 
clones  and  can  avoid  preliminary  contig 
mapping  (see  abstracts  of  Glen  Evans; 
Julie  Korenberg;  Mark  Adams,  Leroy 
Hood,  and  Melvin  Simon;  and  Pieter  de 
Jong  in  Part  2  of  this  report). 

Updated  information  on  BAC-PAC  re- 
sources can  be  found  on  the  Web  (http:// 
www.oml.gov/meetings/hacpac/95bac. 
html).  [See  Appendix  C:  Human  Subjects 
Guidelines,  p.  77  ot  http://www.oml. 
gov/hgmis/archive/nchgrdoe.  html  for 
DOE-NIH  guidelines  on  using  DNA 
from  human  subjects  for  large-scale 
sequencing.] 

cDNA  Libraries 

In  1990,  DOE  initiated  projects  to  en- 
rich the  developing  chromosome  contig 
maps  with  markers  for  genes.  Although 
the  protein-encoding  messenger  RNAs 
are  good  representatives  of  their  source 


genes,  they  are  unstable  and  must  be 
converted  to  complementary  DNAs 
(cDNAs)  for  practical  applications. 
These  conversions  are  tricky,  and  arti- 
facts are  introduced  easily.  The  team  led 
by  Bento  Scares  (University  of  Iowa) 
has  optimized  the  steps  and  continues  to 
produce  cDNA  libraries  of  the  highest 
quality.  At  LL>fL,  individual  cDNA 
clones  are  put  into  standard  arrays  and 
then  distributed  worldwide  for  charac- 
terization by  the  international  IMAGE 
(for  Integrated  Molecular  Analysis  of 
Gene  Expression)  Consortiiun  (see  box. 
p.  13). 

Initially  supported  under  a  DOE  cDNA 
initiative,  Craig  Venter's  team  (now  at 
The  Institute  for  Genomic  Research) 
greatly  improved  technologies  for  read- 
ing sequences  from  cDNA  ends  (ex- 
pressed sequence  tags,  called  ESTs). 
Together  with  complementary  analysis 
software,  ESTs  were  shown  to  be  a  valu- 
able resource  for  categorizing  cDNAs 
and  providing  the  fu-st  clues  to  the  func- 
tions of  the  genes  from  which  they  are 
derived.  This  fast  EST  approach  has  at- 
tracted millions  of  dollars  in  commercial 
investment.  Mapping  the  cDNA  onto  a 
chromosome  can  identify  the  location  of 
its  corresponding  gene.  Many  laborato- 
ries worldwide  are  contributing  to  the 
continuing  task  of  mapping  the  estimated 
70,000  to  100,000  human  genes. 

HAECs 

All  the  previously  described  DNA 
clones  are  maintained  in  bacterial  host 
cells.  However,  for  unknown  reasons, 
some  regions  of  the  human  genome  ap- 
pear to  be  unclonable  or  unstable  in 
bacteria.  The  team  led  by  Jean-Michel 
Vos  (University  of  North  Carolina, 
Chapel  Hill)  has  developed  a  human  ar- 
tificial episomal  chromosome  (HAEC) 
system  based  on  the  Epstein-Barr  virus 
that  may  be  useful  for  coverage  of  these 
especially  difficult  regions.  In  the  broader 
biomedical  community,  HAECs  also 
show  promise  for  use  in  gene  therapy. 
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BAC-/MC  Map.  '!%■  Total  Human  Genome  HAC-PAC 
Resource  represents  an  important  tool  for  umlersfanding 
the  genes  responstbte  for  human  development  and  disease 
(hap://V'"ww.csinc.edu/geneiics/k*)rcnbcnE/korenberg.htmU. 
'!%•  Hesource,  consisting  of  more  than  5()00  BAC  and  PAC 
f  clones,  covers  every  human  chromosome  band  and  25% 

OOE  Human  O^nonw  Program  Raport.  Highlights 


•  MC  mapped  piiRiAniy  ID  WdiOGStion 

•  dJ<;ffl<»e»d'aacond«i%te<te  kipafeg^iaati^uwil' 
<  BJlC  iT^ipBdlto  eautUpte  oettejrimma 

•  SMCmMjE^ftdtoimA^WamvM 

•  SAO  fiMppvtf  vtobaiwiy  i»  •>«  teAHfwv 

»  fine  «n^p»q<q  W»  iwOciii  {»*»&Hn»w»> 

t)//>i^  enure  human  genome.  Each  color  dot  repn'senis  a 
single  BAC  or  /MC  cloiu-  mapped  by  FISH  to  a  specific 
chromosome  band  represented  in  black  arid  white.  The 
clones,  which  an-  .stable  and  useful  for  sequeruing.  have 
been  integrated  with  the  genetic  arul  physural  chromosome 

maps.    (Source-  Julie  Korcnben^.  CeJun-Sinai  Medical  Center} 
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Resources  for  Gene 
Discovery 

Hunting  for  disease  genes  is  not  a  spe- 
cific goal  of  the  DOE  Human  Genome 
Program.  However,  DOE-supported 
libraries  sent  to  researchers  worldwide 
have  facilitated  gene  hunts  by  many  re- 
search teams.  DOE  libraries  have  played 
a  role  in  the  discovery  of  genes  for  cystic 
fibrosis,  the  most  common  lethal  inher- 
ited disease  in  Caucasians;  Huntington's 
disease,  a  progressive  lethal  neurological 
disorder;  Batten's  disease,  the  most 
prevalent  neurodegenerative  childhood 
disease;  two  forms  of  dwarfism;  Fanconi 
anemia,  a  rare  disease  characterized  by 
skeletal  abnormalities  and  a  predisposi- 
tion to  cancer;  myotonic  dystrophy,  the 
most  common  adult  form  of  muscular 
dystrophy;  a  rare  inherited  form  of  breast 
cancer;  and  polycystic  kidney  disease, 
which  affects  an  estimated  500,000 
people  in  the  United  States  at  a  healthcare 
cost  of  over  $1  billion  per  year. 

The  team  led  by  Fa-Ten  Kao  (Eleanor 
Roosevelt  Institute)  has  microdissected 


several  chromosomes  and  made  deriva- 
tive clone  libraries  broadly  available  to 
disease-gene  hunters.  This  resource 
played  a  critical  role  in  isolating  the 
gene  responsible  for  some  15%  of  colon 
cancers. 

Of  Mice  and  Humans: 
The  VaJue  of 
Comparative  Analyses 

A  remaining  challenge  is  to  recognize 
and  discriminate  all  the  fimctional  con- 
stituents of  a  gene,  particularly  regula- 
tory components  not  represented  within 
cDNAs,  and  to  predict  what  each  gene 
may  actually  do  in  human  biology. 
Comparing  human  and  mouse  se- 
quences is  an  exceptionally  powerful 
way  to  identify  homologous  genes  and 
regulatory  elements  that  have  been  sub- 
stantially conserved  during  evolution. 

Researchers  led  by  Leroy  Hood  (Uni- 
versity of  Washington,  Seattle)  have 
analyzed  more  than  I  million  bases  of 
sequence  from  T-cell  receptor  (TCR) 


To  IMAGE  the  Human 
Gene  Map 

Since  1993,  the  Integrated  Molecular 
Analysis  of  Gene  Expression  (IM- 
AGE) Consortium  has  played  a  major 
role  in  ttie  development  of  a  human 
gene  map.  Founding  members  of  the 
IMAGE  Consortium  are  Bettto  Scares 
(Columbia  University,  now  at  Univer- 
sity of  Iowa),  Gregory  Lennon 
(LLNL).  Mihael  Polyraerc^ulos 
(National  Institutes  of  Health's  Na- 
tional Institute  of  Mental  Health), 
and  Charles  Auftrey  {G^nfithon,  in 
France).  Because  cDNA  molecules 
represent  coding  (expressed-gene) 
areas  of  the  genome,  sets  of  cloned 
cDNAs  are  a  valuable  resource  to 
the  gene-mapping  commiinity.  The 


cDNA  libraries  representing  different 
tissues  have  many  members  in  com- 
mon. Thus,  good  coordination  among 
participating  laboratories  can  minimize 
redundant  work.  The  intexnaBoaal  IM- 
AGE Consortium  laboratories  fulfill 
this  role  by  developing  and  arraying 
cDNA  clones  for  worldwide  use. 
lknp://yfww-bio. Unt.gov/bbrp/image/ 
image.htmll 

From  the  IMAGE  cDNA  clones,  re- 
searchers at  the  Washington  University 
(St.  Louis)  Sequencing  Center  deter- 
mine ESTs  with  support  from  Merck, 
Itic.  The  data,  which  are  used  in  gene 
localization,  are  then  entered  into  public 
databases.  More  than  10,000  chromo- 
somal assignments  have  been  entered 
into  Genome  Database  {http.Z/www.gdb. 
org).  Including  replica  copies,  over 


3  million  clones  have  been  distrib- 
uted, probably  representing  daout 
50,000  distinct  human  genes. 

The  IMAGE  infrastructure  is  being 
used  in  two  additional  prograttvs.  At 
LLNL,  the  IMAGE  laboratory  arrays 
mouse  cDNA  libraries  produced  by 
Soares  for  the  Washington  University 
Mouse  EST  project  {ht:p://genome. 
wustt.  edu/est/mousejMthmpg.  hrnif) 
with  sequencing  sponsored  by  the 
Howard  Hughes  Medical  Institute. 
Additional  clone  libraries  are  being 
used  in  a  collaborative  sequencing 
project  sponsored  by  the  NIH  Na- 
tional Cancer  Institute  as  part  of  the 
Cancer  Genome  Anatomy  Project  to 
i<Jeatify  and  fully  sequence  genes 
implicated  in  major  cancers  (http:// 
www.ncbi.nlnunih.gov/iKicgap). 
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chromosome  regions  of  both  human  and 
mouse  genomes.  Many  subtle  functional 
elements  can  be  recognized  only  by 
comparing  human  and  mouse  sequences. 
TCRs  play  a  major  role  in  immunity 
and  autoimmune  disease,  and  insights 
into  their  mechanisms  may  one  day  help 
treat  or  even  prevent  such  diseases  as 
arthritis,  diabetes,  and  multiple  sclerosis 
(possibly  even  AIDS). 

Comparative  analysis  is  also  used  to 
model  human  genetic  diseases.  Given 
sequence  information,  researchers  can 
produce  targeted  mutations  in  the  mouse 
as  a  rapid  and  economical  route  to  elu- 
cidating gene  function.  Such  studies 
continue  to  be  used  effectively  at  Oak 
Ridge  National  Laboratory  (ORNL). 

DNA  Sequencing 

From  the  beginning  of  the  genome 
project,  DOE's  DNA  sequencing- 
technology  program  has  supported  both 
improvements  to  established  method- 
ologies and  innovative  higher-risk  strat- 
egies. The  first  major  sequencing 
project,  a  test  bed  for  incremental  im- 
provements, culminated  with  elucida- 
tion of  the  highly  complex  TCR  region 
(described  above)  by  a  team  led  by 
Hood. 

A  novel  "directed"  sequencing  strategy 
initiated  at  LBNL  in  1993  provides  a 
potential  alternative  approach  that  can 
include  automation  as  a  core  design  fea- 
ture. In  this  approach,  every  sequencing 
template  is  first  mapped  to  its  original 
position  on  a  chromosome  (resolution, 
30  bases).  The  advantages  of  this  method 
include  a  large  reduction  in  the  number 
of  sequencing  reactions  needed  and  in 
the  sequence-assembly  steps  that  follow. 
To  date,  this  directed  strategy  has 
achieved  significant  results  with  simpler, 
less  repetitive  nonhuman  sequences,  par- 
ticularly in  the  NIH-fimded  Drvsophila 
genome  program.  The  system  also  is  in 
use  at  the  Stanford  Human  Genome 
Center  and  Mercator  Genetics,  Inc. 


The  preparation  of  DNA  clones  for  se- 
quencing involves  several  biochemical 
processing  steps  that  require  different 
solution  environments.  At  the  White- 
head Institute,  Trevor  Hawkins  has  im- 
proved systems  for  reversible  binding  of 
DNA  molecules  to  magnetic  beads  that 
are  compatible  with  complete  robotic 
management.  The  second-generation 
Sequatron  fits  on  a  tabletop  with  a 
single  robotic  arm  moving  sample  trays 
between  servicing  stations.  This  very 
compact  system,  supported  by  sophisti- 
cated software,  may  be  ideal  for  labora- 
tories with  limited  or  costly  floor  space. 

Fluorescent  tags  are  critical  components 
of  conventional  automated  sequencing 
approaches.  The  team  of  Richard 
Mathies  and  Alexander  Glazer  (Univer- 
sity of  California,  Berkeley)  has  made  a 
series  of  improvements  in  fluorescence 
systems  that  have  decreased  DNA  input 
needs  and  markedly  increased  the  qual- 
ity of  raw  data,  thereby  supporting 
longer  useful  reads  of  DNA  sequence. 

Complementary  improvements  in  enzy- 
mology  have  been  achieved  by  the  team 
of  Charles  Richardson  and  Stanley  Ta- 
bor (Harvard  Medical  School).  Current 
widely  used  procedures  for  automated 
DNA  sequencing  involve  cycling  be- 
tween high  and  low  temperatures.  The 
Harvard  researchers  used  information 
about  the  three-dimensional  structure  of 
polymerases  (enzymes  needed  for  DNA 
replication)  and  how  they  function  to 
engineer  an  improved  Taq  polymerase. 
ThermoSequenase,  which  is  now  pro- 
duced commercially  as  part  of  the 
ThermoSequenase  kit,  reduces  the 
amount  of  expensive  sequencing  re- 
agents required  and  supports  popular 
cycle -sequencing  protocols. 

The  application  of  higher  electrical 
fields  in  gel  electrophoresis  separation 
of  DNA  fragments  can  increase  se- 
quencing speed  and  efficiency.  Conven- 
tional thick  gels  cannot  adequately 
dissipate  the  additional  heat  produced, 
however.  Two  promising  routes  to 
"thinness"  are  ultrathin  slab  gels  and 
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capillary  systems.  An  ultrathin  gel  sys- 
tem was  developed  by  Lloyd  Smith 
(University  of  Wisconsin,  Madison)  and 
licensed  for  commercial  development. 

The  replacement  of  gels  by  pumpable 
solutions  of  long  polymers  is  making 
capillary  array  electrophoresis  (CAE) 
potentially  practical  for  DNA  sequenc- 
ing. The  first  CAE  system  for  DNA  was 
demonstrated  by  the  team  of  Barry 
Kaiger  (Northeastern  University).  In 
1995,  Kargerand  Norman  Dovichi  (Uni- 
versity of  Alberta,  Canada)  separately 
identified  CAE  conditions  under  which 
DNA  sequencing  reads  could  be  ex- 
tended usefully  up  to  the  1000-base 
range.  Another  CAE  system,  developed 
by  Edward  Yeung  (Iowa  State  Univer- 
sity), has  been  Ucensed  for  commercial 
production  (see  box,  p.  23).  Mathies  has 
developed  a  system  in  which  a  confocal 
microscope  displays  DNA  bands.  Appli- 
cation of  this  system  to  the  sizing  of 
larger  DNA  fragments  binding  multiple 
fluors  allows  single-molecule  detection. 

Replacing  the  gel-separation  step  with 
mass  spectroscopy  (MS)  is  another 
promising  approach  for  rapid  DNA  se- 
quencing. MS  uses  differences  in  mass- 
to-charge  ratios  to  separate  ionized 
atoms  or  molecules.  Early  efforts  at  MS 
sequencing  were  plagued  by  chemical 
reactivity  during  the  "launching"  phase 
of  matrix-assisted  laser  desorption  ion- 
ization (MALDI).  MALDI  badly  de- 
graded die  DNA  sample  input.  However, 
the  degradation  chemistry  was  elucidated 
in  Smith's  laboratory,  leading  to  improve- 
ments. At  ORNL,  the  team  of  Chung- 
Hsuan  Chen  has  performed  extensive 
trials  of  alternative  matrices  and  has 
achieved  significant  improvements  that 
now  support  sequence  reads  up  to  100 
DNA  bases.  The  system  is  undergoing 
trials  for  DNA  diagnostic  applications. 

The  most  revolutionary  sequencing  tech- 
nology is  being  pursued  by  the  team  of 
Richard  Keller  and  James  Jett  at  LANL. 
Their  goal  is  to  read  out  sequence  ftom 
single  DNA  molecules,  work  that  builds 


on  LANL's  expertise  in  flow  cytometry. 
The  strand  to  be  sequenced  is  labeled 
first  with  fluors  that  distinguish  the 
four  DNA  subunits  and  is  then  sus- 
pended in  a  flow  stream.  An  exonu- 
clease  cleaves  the  subunits,  which  flow 
past  an  interrogating  laser  system  that 
reports  the  subunits'  identities.  All  sys- 
tem constituents  are  operational  but 
limited  by  the  low  subunit  release  rates 
of  commercially  available  exonu- 
cleases.  A  current  developmental  focus 
is  on  identifying  more  active  exonu- 
cleases. 

Synthetic  DNA  strands  in  the  15-  to  30- 
base  range  (oligomers)  play  essential 
roles  in  DNA  sequencing;  in  sample- 
preparation  steps  for  the  polymerase 
chain  reaction,  which  copies  DNA 
strands  millions  of  times;  and  in  DNA- 
based  diagnostics.  The  cost  of  custom 
oligomer  synthesis  once  was  a  limiting 
factor  in  many  research  projects.  A 
more  economical,  highly  parallel  oligo- 
mer synthesis  technology  was  devel- 
oped by  Thomas  Brennan  at  Stanford 
University  (see  last  bullet,  p.  22,  for 
further  details). 

The  sequencing  by  hybridization 
(SBH)  technology  provides  information 
only  on  short  stretches  of  DNA  in  a 
single  trial  (interrogation),  but  thou- 
sands of  low-cost  interrogations  can  be 
performed  in  parallel.  SBH  is  very  use- 
ful for  rapid  classification  of  short 
DNAs  such  as  cDNAs,  very  low  cost 
DNA  resequencing,  and  detection  of 
DNA  sequence  differences  (polymor- 
phisms) over  short  regions.  The  team  of 
RadomirCrkvenjakov  and  Radoje 
Drmanac  invented  one  format  of  SBH 
while  in  Yugoslavia,  made  substantial 
improvements  at  Argonne  National 
Laboratory  (ANL),  and  later  started 
Hyseq  Inc.  to  commercialize  these 
technologies.  At  ANL,  another  imple- 
mentation, SBH  on  matrices  (SHOM) 
of  gels,  holds  promise  for  high-accu- 
racy sequence  firoofreading  and  diverse 
DNA  diagnostics.  The  ANL  team,  led 
by  Andrei  Mirzabekov,  collaborates 
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with  the  Englehardt  Institute  in  Moscow, 
where  SHOM  was  demonstrated  initially. 

Informatics:  Data 
Collection  and  Analysis 

Explosive  growth  of  information  and  the 
challenges  of  acquiring,  representing, 
and  providing  access  to  data  pose  continu- 
ing monumental  tasks  for  the  large  public 
databases.  Over  the  last  3  years,  the  Ge- 
nome Database  (GDB),  the  major  inter- 
national repository  of  human  genome 
mapping  data,  has  made  extensive  changes 
culminating  in  the  enhanced  representa- 
tion of  genomic  maps  and  gene  informa- 
tion in  GDB  V6.0.  Major  issues  for  the 
Genome  Sequence  DataBase  (GSDB), 
established  in  1994,  are  to  capture  and 
annotate  the  sequence  data  and  to  repre- 
sent it  in  a  form  capable  of  supporting 
complex,  ad  hoc  queries.  Both  GDB  and 
GSDB  have  been  restrucnired  recently  to 
handle  the  increasing  flood  of  data  and 
make  it  more  useful  for  downstream 
biology  (see  Research  Narratives,  GDB, 
p.  49,  and  GSDB,  p.  55.  [hnp://www.gdb. 
org  and  http://www.ncgr.org/gsdb] 

Victor  Markowitz,  formerly  of  LBNL,  has 
developed  a  suite  of  database  tools  allow- 
ing substantial  modifications  of  underly- 
ing data  structures  while  the  biologists' 
query  tools  remain  stable.  lhttp://gizmo. 
lbt.gov/DMjrOOLS/DMTooh.html] 

The  Genome  Annotation  Consortium 
(based  at  ORNL)  was  initiated  in  1997  to 
be  a  modular,  distributed  informatics  fa- 
cility for  analyzing  and  processing  (e.g., 
annotating)  genome-scale  sequence  data. 

The  many  improvements  in  World  Wide 
Web  software  now  enable  maps  to  be 
downloaded  simply  by  using  a  browser 
with  accessory  software  provided  by 
GDB.  Computers  sift  stretches  of  DNA 
sequence  for  patterns  that  identify  such 
biologically  important  features  as  pro- 
tein-coding regions  (exons),  regulatory 
areas,  and  RNA  splice  sites.  Other  com- 
puter tools  are  used  to  compare  a  new  se- 


quence (i.e.,  a  putative  gene)  against  all 
other  database  entries,  retrieve  any  ho- 
mologous sequences  that  already  have 
been  entered,  and  indicate  the  degree  of 
similarity. 

The  Gene  Recognition  and  Analysis 
Internet  Link  (GRAIL)  at  ORNL  local- 
izes genes  and  other  biologically  impor- 
tant sequence  features  (see  box,  p.  17). 

Another  analytical  service  that  returns 
informative,  annotated  data  is  MAG- 
PIE, provided  through  ANL  by  Terry 
Gaasterland.  MAGPIE  is  designed  to 
reside  locally  at  the  site  of  a  genome 
project  and  actively  carry  out  analysis 
of  genome  sequence  data  as  it  is  gener- 
ated, with  automated  continued  reevalu- 
ation  as  search  databases  grow  (http:// 
www.  mcs.  an  I.  go  v/home/gaaste  rl/ 
magpie.html).  Once  an  automated  func- 
tional overview  has  been  established,  it 
remains  to  pinpoint  the  organisms'  ex- 
act metabolic  pathways  and  establish 
how  they  interact.  To  this  end,  the  WIT 
CWhat  is  There)  system,  which  succeeds 
PUMA,  sufjports  the  construction  of 
metabolic  pathways.  Such  constructions 
or  models  are  based  on  sequence  data, 
the  clearly  established  biochemistry  of 
specific  organisms,  and  an  understand- 
ing of  the  interdependencies  of  bio- 
chemical mechanisms.  WIT,  which  was 
developed  by  Evgenij  Selkov  and  Ross 
Overbeek  at  ANL,  offers  a  particularly 
valuable  tool  for  testing  current  hypoth- 
eses about  microbial  biology,  fhttp:// 
www.cme.nuiu.edu/WIT] 

Researchers  at  the  University  of  Colo- 
rado have  developed  another  approach 
for  predicting  coding  regions  in  ge- 
nomic DNA,  combining  multiple  types 
of  evidence  into  a  single  scoring  func- 
tion, and  returning  both  optimal  and 
ranked  suboptimal  solutions.  The  ap- 
proach is  robust  to  substitution  errors 
but  sensitive  to  frameshift  errors.  The 
group  is  now  exploring  methods  for 
predicting  other  classes  of  sequence  re- 
gions, especially  promoters,  /software 
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GRAIL  and  C^nQuest 

Itt  1996  the  Gene  Recognitjon  and 
Ajialysis  Internet  Link  (GRAIL) 
processed  nearly  40  million  bases 
of  seqaence  per  month,  making  it 
the  most  widely  used  "gene- 
ftodjng"  system  available.  Devel- 
oped at  Oak  Ridge  National  Labo- 
ratory (ORNL)  by  a  team  led  by 
Ed  Uberbacber,  GRAIL  uses  arti- 
ficial intelligence  and  machine 
learning  to  discover  complex  rela- 
tionships in  sequence  data.  Tbe 
geaQuest  server,  also  at  ORNL, 
compares  information  generated 
by  GRAIL  with  data  in  protein, 
DNA.  sod  motif  datable*  to  add 
furtlier  value  to  annotation  of 
DNA  sequences. 

GRAIL'S  latest  version  ( J  J)  com- 
birtes  a  Motif  Orapbicai  Client 
with  improved  sensitivity  and 
spUce-&ite  recognition,  better  per- 
formaoce  in  AT-rich  regions,  new 
analysis  systems  for  model  organ- 
isms, and  fraanieshi^  detection. 
Hiis  system  can  be  used  on  a  wide 
variety  of  UNIX  platforms,  includ- 
ing Sim,  DEC  and  SG!.  Tbe  many 
ways  to  access  GRAiL  include  a 
command  line  sockets  client  thai 


pernuts  remote  program  calls  to  all 
basic  GRAIL-genQuest  analysis 
services,  thus  allowing  convenient 
inte^atioQ  of  GRAIL  results  into 
automated  analysis  pipelines. 


Contact  GRAIL  staff  through  the 
Web  site  at  http://comphw.omL 
gov  or  at  GRAILMAJL@omLgov 
for  e-mati  and  ftp  access. 


and  information:  http://beagle.colorado. 

eduZ-eesnyder/GeneParserhtml] 

The  Baylor  College  of  Medicine  (BCM) 
Search  Launcher  improves  user  access 
to  the  wide  variety  of  database-search 
tools  available  on  the  Web.  Search 
Launcher  features  a  single  point  of  en- 
try for  related  searches,  Che  addition  of 
hypertext  links  to  results  returned  by  re- 
mote servers,  and  a  batch  client  [hup:// 
gc.hcm.tmc.edu:8088/search-launcher/ 
launcher.html  I 

FASTA-SWAP,  also  from  the  BCM 
group,  is  a  new  pattern-search  tool  for 
databases  that  improves  sensitivity  and 
specificity  to  help  detect  related  se- 
quences. BEAUTY,  an  enhanced  ver- 
sion of  tfie  BLAST  database-search 
program,  improves  access  to  informa- 


tion about  the  functions  of  matched 
sequences  and  incorporates  additional 
hypertext  links.  Graphical  displays  al- 
low correlation  of  hit  positions  with  an- 
notated domain  positions.  Fumre  plans 
include  providing  access  to  information 
from  and  direct  links  to  other  databases. 
including  organism -specific  databases. 

PROCRUSTES  uses  comparisons  of 
the  same  gene  of  different  species  to 
delimit  gene  structure  much  more  accu- 
rately. The  product  of  a  collaboration 
between  Pavel  Pevzner  (University  of 
Southern  California)  and  two  Russian 
researchers,  PROCRUSTES  is  based  on 
the  spliced-alignment  algorithm,  which 
explores  all  possible  exon  assemblies 
and  finds  the  multiexon  structure  that 
best  fits  a  related  protein,  [hrtp:// 
wwW'hto.  use.  edu/software/procrustes  / 
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/.VAU***  componrni  ofthf  DOE 
Human  Genome  Program 
supports  projects  to  help  judges 
understand  the  xcientifn' 
validity-  of  the  genetics-based 
claims  that  are  poised  to  flood 
the  nation  s  courtrooms  Robert 
E  Orr  (left)  of  the  North 
Carolina  Supreme  Court  and 
Francis  X.  Spina  of  the  Massa- 
chusetts Appeals  Court  at  the 
New  England  Regional 
Conference  on  the  Courts  and 
Genetics  (July  1997)  {v:  rite ipate 
in  a  hands-on  laboratory 
session.  As  a  prelude  to  learning 
the  fundamentals  ofDNA 
science  and  genetic  testing,  the 
judges  are  precipitating  DNA 
(seen  as  streaks  on  the  glass  rod 
in  the  tuhejfrrjm  a  solution 
containing  the  bacterium 
Escherichia  coli.  (Courts  and 
Science  On-IJtw  Magazine: 
hap  ://wTv  w.oml  .go  v/courti-/ 


Ethical,  Legal,  and 
Social  Issues  (ELSI) 

From  the  outset  of  the  Human  Genome 
Project,  researchers  recognized  that  the 
resulting  increase  in  knowledge  about 
human  biology  and  personal  genetic  in- 
formation would  raise  complex  ethical 
and  policy  issues  for  individuals  and 
society.  Rapid  worldwide  progress  in 
the  project  has  heightened  the  uigency 
of  this  challenge. 

Most  observers  agree  thai  personal 
knowledge  of  genetic  susceptibility  can 
be  expected  to  serve  humankind  well. 
opening  the  door  to  more  accurate  diag- 
noses, preventive  intervention,  intensi- 
fied screening,  lifestyle  changes,  and 
early  and  effective  treatment.  But  such 
knowledge  has  another  side,  too:  risk  of 
anxiety,  unwelcome  changes  in  personal 
relationships,  and  the  danger  of  stigma- 
tization.  Often,  genetic  tests  can  indi- 
cate possible  future  medical  conditions 
far  in  advance  of  any  symptoms  or 
available  therapies  or  treatments.  If 
handled  carelessly,  genetic  information 
could  threaten  an  individual  with  dis- 
crimination by  potential  employers  and 


Other  issues  are  perhaps  less  immediate 
than  these  personal  concerns  but  no  less 
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challenging.  How.  for  example, 
are  products  of  the  Human  Ge- 
nome Project  to  be  patented  and 
commercialized?  How  are  the  ju- 
dicial, medical,  and  educational 
communities — not  to  mention  the 
public  at  large — to  be  educated 
effectively  about  genetic  research 
and  its  implications? 

To  confront  these  issues,  the  DOE 
and  NIH  ELSI  programs  jointly 
established  an  ELSI  working 
group  to  coordinate  policy  and 

research  between  the  two  agencies. 
/An  FY  1997  report  evaluating 
the  joint  ELSI  group  is  available 
on  the  Web  {http://www.omLgov/ 
■     hgmis/archive/elsirept.htm[).l 

The  DOE  Human  Genome  Program  has 
focused  its  ELSI  efforts  on  education. 
privacy,  and  the  fair  use  of  genetic  in- 
formation (including  ownership  and 
commercialization);  workplace  issues, 
especially  screening  for  susceptibilities 
to  environmental  agents;  and  implica- 
tions of  research  fmdings  regarding  in- 
teractions among  multiple  genes  and 
environmental  influences. 

A  few  highlights  from  the  DOE  ELSI 
portfolio  for  FY  1994  through  FY  1997 
are  outlined  below. 

•  Three  high  school  curriculum  mod- 
ules developed  by  the  Biological 
Sciences  Curriculum  Study  (BSCS). 
{  http://www.bscs.org  } 

•  An  educational  program  m  Los  Ange- 
les to  develop  a  culturally  and  linguis- 
tically appropriate  genetics  curriculum 
based  on  a  BSCS  module  (see  above) 
for  Hispanic  students  and  their  fami- 
lies. lhttp://vflylab.calstatela.edu/hgp] 

•  A  series  of  workshops  to  educate  a 
core  group  of  1000  judges  around  the 
nation  and  a  handbook  with  compan- 
ion videotape  to  assist  federal  and 
state  judges  in  understanding  and  as- 
sessing genetic  evidence  in  an  in- 
creasing number  of  civil  and  criminal 
cases  (see  photo  above). 
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•  Educational  materials  developed  by 
the  Science+Literacy  for  Health 
Project  of  the  American  Association 
for  the  Advancement  of  Science 
(AAAS)  and  targeted  at  or  above  the 
6th-  to  8th-grade  reading  levels. 
[AAAS:  202/326-6453,-  Your  Genes. 
Your  Choices  booklet;  hrtp://www. 
nextwave.org/ehrAmoks/index.  html/ 

•  A  program  at  the  University  of  Chi- 
cago aimed  at  developing  a  knowl- 
edge base  for  physicians  and  nurses 
who  will  train  other  practitioners  to 
introduce  new  genetic  services. 

•  A  series  of  radio  programs  (see  photo 
at  right)  on  the  science  and  ethical 
issues  of  the  genome  project  and  a 
TV  documentary  program  on  ELSI 
issues.  lhnp://www.pbs.orgJ 

•  The  Gene  Letter,  a  monthly  online 
newsletter  on  ELSI  issues  for 
healthcare  professionals  and  consimi- 
ers.  Ihttp.V/www.geneletter.orgl 

•  A  congressional  fellowship  program 
in  human  genetics,  administered 
through  AAAS,  for  one  annual  fel- 
lowship for  a  mid-career  geneticist. 
lsociety@genetics.faseb.orgJ 

•  The  draft  Genetic  Privacy  Act,  pre- 
pared as  a  model  for  privacy  legisla- 
tion and  covering  the  collection, 
analysis,  storage,  and  use  of  DNA 
samples  and  the  genetic  information 
derived  from  them.  lhttp://www.oml. 
gov/hgmis/resource/p:  ivacy/ 
privacyl.htmlj 

•  Privacy  studies  at  the  Center  for  So- 
cial and  Legal  Research,  including  an 
analysis  of  the  effects  of  new  genetic 
technologies  on  individuals  and  insti- 
tutions. 

For  details  on  these  and  other  projects, 
see  ELSI  Abstracts,  p.  45.  in  Part  2  of  this 
report  In  addition  to  the  specific  projects 
listed  in  Part  2,  the  DOE  program  spon- 
sors a  nimiber  of  conferences  and  work- 
shops on  ELSI  topics. 


DOE  ELSI  Web  Site 
hilp://wwv.ornLgo\/ltgmis/resource/ekLhtml 


Protection  of  Human  Research  Subjects 


In  i  9%.  President  Clinton  appoimcd  the  Nauonal  Bioeihics  Advisory  Com- 
missioa  lo  provide  guidance  on  the  ethical  conduct  of  cuneni  and  fiimrc  bio- 
logical and  behavioral  research,  especially  that  related  lo  genetics  and  the 
rights  and  welfare  of  human  research  subjects  (hlxp://www.nik.gm/nbac/ 
nbac.kun}. 

Also  in  19%,  DOE  and  NW  issued  a  document  providing  invesugatots  with 
guidance  in  the  use  of  DNA  from  human  subjects  for  laige-scale  sequencing 
projecK  (see  Appendix  C  Human  Subjects  Guidelines,  p.  17).  ihnp://www. 
omLgoWkgnus/archive/nch^rdfW.htmll 
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Technology  Transfer 


Transferring  technology  lo 
the  private  sector,  a  pri- 
mary mission  of  OOE.  is 
strongly  encouraged  in  the 
Human  Genome  Program 
to  enhance  the  nation's  investment  in 
research  and  technological  competitive- 
ness. Human  genome  centers  at 
Lawrence  Berkeley  National  Laboratory 
(LBNL),  Lawrence  Livermore  National 
Laboratory  (LLNL),  and  Los  Alamos 
National  Laboratory  (LANL)  provide 
opportunities  for  private  companies  to 
collaborate  on  joint  projects  or  use  labo- 
ratory resources.  These  opportunities  in- 
clude access  to  information  (including 
databases),  personnel,  and  special  facili- 
ties; informal  research  collaborations; 
Cooperative  Research  and  Development 
Agreements  (CRADAs);  and  patent  and 
software  licensing.  For  information  on 
recently  developed  resources,  contact 
individual  genome  research  centers  or 
see  Research  Highlights,  beginning  on 
p.  9.  Many  universities  have  their  own 
licensing  and  technology  transfer  offices. 

Some  collaborations  and  technology- 
transfer  highlights  from  FY  1994 
through  FY  1996  are  described  below. 

Collaborations 

Involvement  of  the  private  sector  in  re- 
search and  development  can  facilitate 
successful  transfer  of  technology  to  the 
marketplace,  and  collaborations  can 
speed  production  of  essential  tools  for 
genome  research.  A  number  of  interac- 
tive projects  are  now  under  way,  and 
others  are  in  preliminary  stages. 

CRADAs 

One  technology-transfer  mechanism 
used  by  DOE  national  laboratories  is 
the  CRADA,  a  legal  agreement  with  a 
nongovernmental  organization  to  col- 
laborate on  a  defined  research  project. 
Under  a  CRADA.  the  two  entities  share 
scientific  and  technological  expertise, 
with  the  governmental  organization  pro- 
viding personnel,  services,  facilities. 


equipment,  or  other  resources.  Funds 
must  come  from  the  nongovernmental 
partner.  A  benefit  to  participating  com- 
panies is  the  opportutiity  to  negotiate 
exclusive  licenses  for  inventions  arising 
from  these  collaborations.  For  periods 
through  1996.  the  CRADAs  in  place  in 
the  DOE  Human  Genome  Program  in- 
cluded the  following: 

•  LLNL  with  Apphed  Biosystems 
Division  of  PCTkin-Elmer  Corporation 
to  develop  analytical  instrumentation 
for  faster  DNA  sequencing  instru- 
mentation; 

•  LANL  with  Amgen.  Inc..  to  develop 
bioassays  for  cell  growth  factors; 

•  Oak  Ridge  National  Laboratory 
(ORNL)  with  Darwin  Molecular, 
Inc.,  for  mouse  models  of  human 
immunologic  disease; 

•  ORNL  with  Proctor  & 

Gamble.  Inc..  for 
analyses  of  liver  regen- 
eration in  a  mouse 
model;  and 

•  Brookhaven  National 
Laboratory  with  U.S. 
Biochemical  Corpora- 
tion to  identify  proteins 
useful  for  primer- 
waUdng  methods  and 
large-scale  sequencing. 

Work  for  Others 

In  other  collaborations, 
the  LBNL  genome  center 
is  participating  in  a  Work 
for  Others  agreement 
with  Amgen  to  automate 
the  isolation  and  charac- 
terization of  large  num- 
bers of  mouse  cDNAs. 
The  center  group  is  focus- 
ing on  adapting  LBNL's 
automated  colony -picking 
system  to  cDNA  protocols 
and  applying  methods  to 
generate  large  numbers  of 
filter  replicas  for  colony 


Couverting  scientific 
knowledge  into 
commercially  useful 
products 


Technology  Transfer 
Legislation 


Technology  transfer  involves  converting 
scientific  koowtedge  into  corametciaDy 
useful  product-s.  Through  the  1980s,  a  se- 
ries of  laws  was  enacted  to  encourage  the 
deveJopmeot  of  commercial  appiications 
of  federally  funded  research  at  univetsitjes 
and  federal  laboraioiies.  Such  laws  {chiefly 
the  Bayh-Dofe  Act  of  19S0.  Stevenson- 
Wydier  Actof  15»S0,  and  Federal  Technol- 
ogy TransfCT  Act  of  1986  <  Public  Laws 
96-51 7, 96-480,  and  99-502,  respectively)) 
were  not  aimed  specifically  at  genome  or 
even  biotnedica]  research.  However,  such 
research  and  the  surrounding  commercial 
biotechnology  enterprises  cleariy  have 
benefited  from  them.  The  biotechnology 
sector's  success  owes  much  to  federal 
policies  on  technology  transfer  and  intel- 
lecwal  property,  [Soufcet  U.S.  Congress, 
Office  of  Technology  Assessment,  Fed- 
eral Technology  Transfer  and  the  Human 
Genome  Project,  OT.ii|,-BP-EHR-t62 
(Washington, DC:  US  Govetnmenl 
Printing  OfficB.  September  1995)] 
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filter  hybridization  and  subsequent 
analysis.  ["Work  for  Others"  projects 
supported  by  an  agency  or  organization 
other  than  DOE  (e.g.,  NIH,  National 
Cancer  Institute,  or  a  private  company) 
can  be  conducted  at  a  DOE  installation 
because  this  work  is  complementary  to 
DOE  research  missions  and  usually  re- 
quires multidisciplinary  DOE  facilities 
and  technologies.] 

The  Resource  for  Molecular  Cytogenetics 
was  established  at  LBhJL  and  the  Uni- 
versity of  California  (I'C),  San  Fran- 
cisco, with  the  support  of  the  Office  of 
Biological  and  Environmental  Research 
and  Vysis,  Inc.  (formerly  Imagenetics). 
The  Resource  aims  to  apply  fluorescent 
in  situ  hybridization  (FISH)  techniques 
to  genetic  analysis  of  human  tissue 
samples;  produce  probe  reagents;  design 
and  develop  digital-imaging  micros- 
copy; distribute  probes,  analysis  tech- 
nology, and  educational  materials  in  the 
molecular  cytogenetic  community;  and 
transfer  useful  reagents,  processes,  and 
instruments  to  the  private  sector  for 
commercialization. 


NIST  Advanced  Technology  Program 


Several  commercia}  applications  of  research  sponsored  by  the  VS. 
H^man  Genome  Project  have  been  fiirthered  by  ihe  Advanced 
Technology  Program  (ATP)  of  the  II  ,S  National  Institute  of  Stan- 
dards and  Technology.  ATP's  missioB  is  to  stimulate  economic 
growth  and  indu.-!lrial  competitiveness  by  encouraging  hjgb-risk 
but  powerful  new  technologies.  Its  Tools  for  DN  A  Diagnostics 
progiara  uses  coUabotations  among  researchers  and  industry  to 
develop  (I)  cost-effective  methods  for  determining,  analyzing,  and 
storing  DN  A  sequences  for  a  wide  %'ariet)  of  diagnostic  applica- 
tions ranging  from  healthcare  to  agriculture  to  the  eavironmeai  and 
(2)  a  new  and  potentially  very  large  market  for  DNA  diagnostic 
systems. 

Awardecs  have  included  companies  developing  DNA  diagnostic 
Chips,  more  powerful  cytogenetic  diagnostic  techniques  based  on 
compaiative  genomic  hybridization,  DNA  sequencing  insutuoen- 
tatiofi,  and  DNA  analysis  tech&ology.  Eventually,  commercializa- 
tion of  these  underlying  technologies  is  expected  to  generate 
hundreds  of  thousands  of  jobs,  /80Q/287-3863,  Fax:  3017926-9524, 
atp@micf.ruxt.gov,  http:/Avww.atp.niift.jiavj 


Patenting  and 
Licensing  Highlights, 
FY  1994-96 

•  A  development  license  for  single- 
molectile  DNA  sequencing  replaced 
the  1991-94  CRADA  (the  first 
CRADA  to  be  established  in  the  U.S. 
Human  Genome  Project)  between 
LANL  and  Life  Technologies,  Inc. 
(LTI). 

•  In  1995,  a  broad  patent  was  awarded 
to  UC  for  chromosome  painting.  This 
technology  uses  FISH  to  stain  spe- 
cific locations  in  cells  and  chromo- 
somes for  diagnosing,  imaging,  and 
studying  chromosomal  abnormalities 
and  cancer.  Resulting  from  a  1989 
CRADA  between  LLNL  and  UC, 
FISH  was  licensed  exclusively  to 
Vysis. 

•  Hyseq,  Inc.,  was  founded  in  1993  by 
former  Argonne  National  Laboratory 
researchers  Radoje  Drmanac  and 
Radomir  Crkvenjakov  to  commer- 
cialize the  sequencing  by  hybridiza- 
tion (SBH)  technology.  Hyseq  has 
exclusive  patent  rights  to  a  variation 
known  as  format  3  of  SBH  or  the 
"super  chip."  Hyseq  later  won  an  Ad- 
vanced Technology  Program  award 
from  the  U.S.  National  Institute  of 
Standards  and  Technology  to  develop 
the  technology  further. 

•  Oligomers — short,  single-stranded 
DNAs — are  crucial  reagents  for  ge- 
nome research  and  biomedical  diag- 
nostics. ProtoGene  Laboratories, 
Inc.,  was  founded  to  conunercialize 
new  DNA  synthesis  technology 
(developed  initially  at  LBNL  with 
completed  prototypes  at  Stanford 
University)  and  to  offer  the  first 
lower-cost  custom  oligomer  syn- 
thesis. The  Parallel  Array  Synthesis 
system,  which  independently  synthe- 
sizes 96  oligomers  per  run  in  a  stan- 
dard 96-well  microliter  plate  format, 
shows  great  promise  for  significant 
cost  reductions.  I*rotoGene  first 
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licensed  sales  and  distribution  to  LTI 
and,  later,  production  rights  as  well. 
LTI  operates  production  centers  in 
the  United  States,  Europe,  and  Japan. 

•  The  GRAIL-genQuest  sequence- 
analysis  software  developed  at 
ORNL  was  licensed  by  Martin 
Marietta  Energy  Systems  (now 
Lockheed  Martin  Energy  Research) 
to  ApoCom,  Inc.,  for  pharmaceutical 
and  biotechnology  company  re- 
searchers who  cannot  use  the  Internet 
because  of  data-security  concerns. 
The  public  GRAIL-genQuest  service 
remains  freely  available  on  the 
Internet  (see  box,  p.  17). 

•  In  1995,  an  exclusive  license  was 
granted  to  U.S.  Biochemical  Corpo- 
ration for  a  genetically  engineered, 
heat-stable,  DNA-replicating  enzyme 
with  much-improved  sequencing 
properties.  The  enzyme  was  devel- 
oped by  Stanley  Tabor  at  Harvard 
University  Medical  School. 

•  In  1995,  an  advanced  capillary  array 
electrophoresis  system  for  sequenc- 
ing DNA  was  patented  by  Iowa  State 
University.  The  system  was  licensed 
to  Premier  American  Technologies 
Corporation  for  commercialization 
(see  graphic  at  right  and  R&D  100 
Awards,  next  page). 

•  In  1996,  a  patent  was  granted  to 
LANL  researchers  for  DNA  fragment 
sizing  and  sorting  by  laser-induced 
fluorescence.  An  exclusive  license 
was  awarded  to  Molecular  Technol- 
ogy, Inc.,  for  commercialization  of 
the  single- molecule  detection  capa- 
bility related  to  DNA  sizing  (see 
R&D  100  Awards,  next  page). 

SBIR  and  STTR 

Small  Business  Innovation  Research 
(SBIR)  Program  awards  are  designed  to 
stimulate  commercialization  of  new 
technology  for  the  benefit  of  both  the 
private  and  public  sectors.  The  highly 
competitive  program  emphasizes 


cutting-edge,  high-risk  research  with 
potential  for  high  payoff  in  different  ar- 
eas, including  human  genome  research. 
Small  business  firms  with  fewer  than 
500  employees  are  invited  to  submit 
applications.  SBIR  human  genome  top- 
ics concentrate  on  innovative  and  ex- 
perimental approaches  for  carrying  out 
the  goals  of  the  Human  Genome  Project 
(see  SBIR,  p.  63,  in  Part  2  of  this  re- 
port). The  Small  Business  Technology 
Transfer  (STTR)  Program  fosters  trans- 
fers between  research  institutions  and 
small  businesses.  ^DOE  SBIR  and 
STTR  contact  Kay  Etzler  (301/903- 
5867,  Fax:  -5488,  Kay.Etzler@oer.doe. 
gov),  http://sbirerdoe.gOv/.sbir 
http://.<!ttrerdoe.gov/sttr] 


Capillary  Array  Electrophoresis  (CAE).  CAE  .syMfm.',  promise  dramaluaUy 
fcisrfr  arid  hisher-re.solutior.  fragment  .separation  for  DNA  sequencing.  A 
mulliple.xed  CAl:  .'^y.slem  designed  by  (:dv.(xrd  Yeur.g  ilov.n  State  University) 
ha.s  been  developed  for  commercial  production  by  Prcmii-r  American 
Technologies  Corporation  IPATCO).  In  the  f'AlCO  ESiVOiX)  model.  DS'A 
samples  are  introduced  into  the  96-capdlary  array:  as  the  separated 
fragments  pass  through  the  capillaries,  thev  ore  irradiated  all  at  once  with 
laser  light.  Fluorescence  is  measured  hv  a  charged  coupled  device  that  acts 
as  a  simultaneous  multichannel  detector  tin.sel  circle  at  upper  left:  Cl<\seup 
view  of  individual  capiiiary  lanes  Kith  separated  .samples.)  Because  every 
fragment  length  exists  in  the  sample,  bases  arc  identified  in  order  accord- 
ing to  the  time  required  for  then:  to  reach  the  laser-detector  region 
iSoune:  Thamas  Kiir.t;.  PATCOJ 
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Technology  Transfer 
Award 

A  Federal  Laboratory  Consortium 
Award  for  Excellence  in  Technology 
Transfer  was  presented  to  Edward 
Yeung  and  a  research  team  at  Iowa 
State  University's  Ames  Laboratory  in 
1993.  Their  laser-based  method  for 
indirect  fluorescence  of  biological 
samples  may  have  applications  for  rou- 
tine high-speed  DNA  sequencing  (see 
graphic,  p.  23).  Yeung  also  won  the 
1994  American  Chemical  Society 
Award  for  Analytical  Chemistry. 


1997  R&D  100  Awards 

DOE  researchers  in  12  facilities  across 
the  country  won  36  of  the  R&D  100 
Awards  given  by  Research  and  Devel- 
opment Magazine  for  1996  work.  DOE 
award-winning  research  ranged  from 
advances  in  supercomputing  to  the  bio- 
logical recycling  of  tires.  Announced  in 
July  1997,  these  awards  bring  DOE's 
R&D  100  toul  to  453,  the  most  of  any 
single  organization  and  twice  as  many  as 
all  other  govenunent  agencies  combined. 


Two  DOE  genome-related  research 
projects  received  1997  R&D  100 
Awards.  One  was  to  Yeung  {see  text  at 
left  and  graphic,  p.  23)  for  '■ESY9600 
Multiplexed  Capillary  Electrophoresis 
DNA  Sequencer." 

The  other  award  was  to  Richard  Keller 
and  James  Jett  (LANL)  with  Amy 
Gardner  (Molecular  Technologies,  Inc.) 
for  "Rapid-Size  Analysis  of  Individual 
DNA  Fragments."  This  technology 
speeds  determination  of  DNA  fragment 
sizes,  making  DNA  fingerprinting  ap- 
plications in  biotechnology  and  other 
fields  more  reliable  and  practical. 

R&D  Magazine  began  making  annual 
awards  in  1963  to  recognize  the  100 
most  significant  new  technologies, 
products,  processes,  and  materials  de- 
veloped throughout  the  world  during 
the  previous  year  (hnp://www.rdmag. 
com/rdlOO/IOOaward.htm).  Winners  are 
chosen  by  the  magazine's  editors  and  a 
panel  of  75  respected  scientific  experts 
in  a  variety  of  disciplines.  Previous 
winners  of  R&D  100  Awards  include 
such  well-known  products  as  the  flash- 
cube  (1965),  antilock  brakes  (1969), 
automated  teller  machine  (1973),  fax 
machine  (1975),  digital  compact  cassette 
(1993),  and  Taxol  anticancer  drug  (1993). 
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Joint  Genome  Institute 

DOE  Merges  Sequencing  Efforts  of  Genome  Centers 


KUtert  KrHnsLOfiiti 
JCil  .Scientific  Oitcctor 
La«Tence  Livermorc 

Naliuniil  Laboratory 
lOim  l':as)  Avenue.  L-452 
I,ivern»ore,CA  94551 
510/422-5681 
Hhen^aht.Unl.^ov  or 
elhi'rtCa^sholgun.U/il.goi' 


In  a  major  restructuring  of  its 
Human  Genome  Program,  on 
October  23,  1996.  the  DOE 
Office  of  Biological  and  Envi- 
ronmental Research  estab- 
lished the  Joint  Genome  Institute  (JGI) 
to  integrate  work  based  at  its  three 
major  human  genome  centers. 

The  JGI  merger  represents  a  shift  to- 
ward large-scale  sequencing  via  intensi- 
fied collaborations  for  more  effective 
use  of  the  unique  expertise  and  resources 
at  Lawrence  Berkeley  National  Labora- 
tory (LBNL).  Lawrence  Livermore  Na- 
tional Laboratory  (LLNL),  and  Los 
Alamos  National  Laboratory  (see  Re- 
search Narratives,  beginning  on  p.  27  in 
this  report).  Elbert  Branscomb  (LLNL) 
serves  as  JGI's  Scientific  Director. 
Capital  equipment  has  been  ordered, 
and  operational  support  of  about 
$30  million  is  projected  for  the  1998 
fiscal  year. 


Production  DNA  Sequencing  Begun  Worldwide 


The  year  1996  marked  a  transition  to  the  final  and  most  challenging 
phase  of  the  U.S.  Human  Genome  Project,  as  pilot  programs  aimed  at 
refming  large-scale  sequencing  strategies  and  resources  were  funded 
by  DOE  and  NiH  (see  Research  Higbli^ts.  DNA  Sequencing,  p.  14). 
Inleraationally.  large-scale  human  genome  sequencing  was  kicked 
off  in  fate  \99S  when  The  Wellcome  Trust  announced  a  7-year, 
$75-million  grant  lo  the  private  Sanger  Centre  to  scale  up  its  sequenc- 
ing capabilities.  French  investigators  also  have  announced  intentions 
to  begin  production  sequencing. 

Funding  agencies  woridwide  agree  that  rapid  and  free  release  of  data 
is  critical.  Other  issues  include  sequence  accuracy,  types  of  ajwotaiion 
that  will  be  most  useful  to  biologists,  and  how  to  sustain  the  reference 
sequence. 

Tlie  international  Human  Genome  Organisation  maintains  a  Web  page 
to  provide  information  on  current  and  future  sequencing  projects  and 
links  to  sites  of  participating  groups  {http://huso, gdb.org).  The  site 
also  links  to  reports  and  resources  developed  at  the  February  1996  and 
1997  Bermuda  meetings  on  large-scale  human  genome  sequencing, 
which  were  sponsored  by  The  Wellcome  Trust. 


With  easy  access  to  both  LBNL  and 
LLNL.  a  building  in  Walnut  Creek. 
California,  is  being  modified.  Here, 
starting  in  late  FY  1998,  production 
DNA  sequencing  will  be  carried  out  for 
JGI.  Until  that  lime,  large-scale  se- 
quencing will  continue  at  LANL. 
LBNL.  and  LLNL.  Expectations  are 
that  within  3  to  4  years  the  Production 
Sequencing  Facility  will  house  some 
200  researchers  and  technicians  work- 
ing on  high -throughput  DNA  sequenc- 
ing using  state-of-the-art  robotics. 

Initial  plans  are  to  target  gene-rich  re- 
gions of  around  I  to  10  megabases  for 
sequencing.  Considerations  include  gene 
density,  gene  families  (especially  clus- 
tered families),  correlations  to  model 
organism  results,  technical  capabilities. 
and  relevance  lo  the  DOE  mission  (e.g.. 
DNA  repair,  cancer  susceptibility,  and 
impact  of  genoioxins).  The  JGI  program 
is  subject  to  regular  peer  review. 

Sequence  data  will  be  posted  daily  on 
the  Web;  as  the  information  progresses 
to  finished  quality,  it  will  be  submit- 
ted to  public  databases. 

As  JGI  and  other  investigators  involved 
in  ihe  Human  Genome  Project  are  be- 
ginning lo  reveal  the  DNA  sequence  of 
the  3  billion  base  pairs  in  a  reference 
human  genome,  the  data  already  are 
becoming  valuable  reagents  for  explora- 
tions of  DNA  sequence  function  in  the 
body,  sometimes  called  "functional 
genomics."  Although  large-scale  se- 
quencing is  JGI's  major  focus,  another 
important  goal  will  be  to  enrich  the  se- 
quence data  with  information  about  its 
biological  function.  One  measure  of 
JGI's  progress  will  be  its  success  at 
working  with  other  DOE  laboratories, 
genome  centers,  and  non-DOE  aca- 
demic and  industrial  collaborators.  In 
this  way.  JGI's  evolving  capabilities  can 
both  serve  and  benefit  from  the  widest 
array  of  partners. 
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Research  Narratives 

Lawrencv  Livermore  National  Laborator>  Human  C^nome  Center 


The  Human  Genome  Center 
at  Lawrence  Livermore 
National  Laboratory 
(LLNL)  was  established  by 
DOE  in  1991 .  The  center 
operates  as  a  multidisciptinary  team 
whose  broad  goal  is  understanding  hu- 
man genetic  material.  It  brings  together 
chemists,  biologists,  molecular  biolo- 
gists, physicists,  mathematicians,  com- 
puter scientists,  and  engineers  in  an 
interactive  research  environment  fo- 
cused on  mapping,  DNA  sequencing, 
and  characterizing  the  human  genome. 

Goals  and  Priorities 

In  the  past  2  years,  the  center's  goals 
have  undergone  an  exciting  evolution. 
This  change  is  the  result  of  several  fac- 
tors, both  intrinsic  and  extrinsic  to  the 
Hunoan  Genome  Project.  They  include: 
( I )  successful  completion  of  the 
center's  fust-phase  goal,  namely  a 
high -resolution,  sequence -ready  map  of 
human  chromosome  19;  (2)  advances  in 
DNA  sequencing  that  allow  accelerated 
scaleup  of  this  operation;  and  (3)  devel- 
opment of  a  strategic  plan  for  LLNL's 
Biology  and  Biotechnology  Research 
Program  that  will  integrate  the  center's 
resources  and  strengths  in  genomics 
with  programs  in  structural  biology,  in- 
dividual susceptibility,  medical  biotech- 
nology, and  microbial  biotechnology 

The  primary  goal  of  LLNL's  Human 
Genome  Center  is  to  characterize  the 
mammalian  genome  at  optimal  resolu- 
tion and  to  provide  information  and  ma- 
terial resources  to  other  in-house  or 
collaborative  projects  that  allow  exploi- 
tation of  genonuc  biology  in  a  synergis- 
tic manner.  DNA  sequence  information 
provides  the  biological  driver  for  the 
center's  priorities: 

•  Generation  of  highly  accurate  se 
quence  for  chromosome  19. 

•  Generation  of  highly  accurate  se- 
quence for  genomic  regions  of  high 
biological  interest  to  the  mission  of 


the  DOE  Office  of  Biological  and 
Environmental  Research  (e.g.,  genes 
involved  in  DNA  repair,  replication, 
recombination,  xenobiotic  metabo- 
lism, and  cell-cycle  control). 

•  Isolation  and  sequence  of  the  full  in- 
sert of  cDN A  clones  associated  with 
genomic  regions  being  sequenced. 

•  Sequence  of  selected  corresfmnding 
regions  of  the  mouse  genome  in  paral- 
lel with  the  human. 

"  Armotation  and  position  of  the  se- 
quenced clones  with  physical  land- 
marks such  as  linkage  markers  and 
sequence  tagged  sites  (STSs). 

•  Generation  of  mapped  chromo- 
some 19  and  other  genomic  clones 
Icosmids,  bacterial  artificial  chromo- 
somes (BACs).  and  PI  artificial  chro- 
mosomes (PACs)l  for  collaborating 
groups. 

•  Sharing  of  technology  with  other 
groups  to  minimize  duplication  of 

effort. 

•  Support  of  downstream  biology 
projects,  for  example,  structural 
biology,  functional  studies,  human 
variation,  transgenics,  medical  bio- 
technology, and  microbial  biotechnol- 
ogy with  know-how.  technology,  and 
material  resources. 

Center  Organization 
and  Activities 

Completion  and  publication  of  the  metric 
physical  map  of  human  chromosome  19 
(see  p.  28)  in  1995  has  led  to  consolida- 
tion of  many  functions  associated  with 
physical  mapping,  with  increased  empha- 
sis on  DNA  sequencing.  The  center  is  or- 
ganized into  five  broad  areas  of  research 
and  support:  sequencing,  resources,  func- 
tional genomics,  informatics  and  analyti- 
cal genomics,  and  instrumentation.  Each 
area  consists  of  multiple  projects,  and 
extensive  interaction  occurs  both  within 
and  among  projects. 


Hiinuin  Cfenoroe  Ontf  r 
Iviiwrence  Livt;miort  Nulionat 

Lahoraton 
Hit^og>'  and  Dioted)niilog>' 

Research  Program 
7000  Ea.st  Avenue  I,-4S2 
Uvermorc,  C  A  94551 

Anthony  V.  Carrano 

Director 

510/422-5698.  Fax;  /4i3-3110 

carrano  K^llnLgov 

Linda  .Ashworih 
Assistant  to  Center  CHrcct«r 
5I0.''422-56A5.  Fax:  -2282 
ashworthiapUnLgov 


In  heu  of  individual  abstracts, 
research  projects  and  investi- 
gators at  the  LLNL  Human 
Genome  Center  are  repre- 
sented in  this  narrative.  More 
information  can  be  found  on 
the  center's  Web  site  (see  URL 
above). 


Update 


In  1997  Lawrence  Berkeley  Na- 
tional Laboratory.  Lawrence 
Liveraioic  National  Laboratory, 
and  Los  Alamos  National  Labora- 
tory began  collaborating  in  a  Joint 
Genome  Insutute  to  implement 
high- throughput  sequencing  Isce 
p  26  and  Human  Genome  News 
8(2).  1-21, 
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Legend 

In  the  column  labeled  cosmid  clones,  black 
Indicates  a  FISH-ordered  done  where 
distance  between  clones  has  been 
measured  Other  cosmids  are  shown  in 
red.  Genes  are  In  red  to  the  left  of  the 
metric  scale.  Other  markers  are  labeled  In 
black.  A  disease  associated  with  a  specific 
gene  is  shown  in  blue  1o  the  hght  of  the 
metric  scale. 

Restriction-mapped  contig 

BAG,  PAC.  or  Pi  ckine 

YAC  with  known  and  concordant  size 

YAC  wtth  unknown  or  discordant  size 

+     Sequence  lagged  site  (STS) 

STS  and/or  hybridization  results 

§    Polymorphic  marker 

Chronuisame  19  Map.  In  the 

current  m/jp  {nt  left)  of  the  first 
2  million  bases  at  the  p-telomtre 
end  of  chromosome  79,  the 
IxoR  I  restriction-mapped 
contigs  (represented  hy  red  lines) 
provide  the  stoning  material  for 
genomic  sequencing  acmss  a 
region. 


cosmid  genes         2.0  Mb 

clones  (red) 

and 

markers 

(black) 


Sequencing 

The  sequencing  group  is  divided  into 
several  subprojects.  The  cote  team  is  re- 
sponsible for  the  construction  of  se- 
quence libraries,  sequencing  reactions, 
and  data  collection  for  all  templates  in 
die  random  phase  of  sequencing.  The 
finishing  team  worlcs  with  data  pro- 
duced by  the  core  team  K)  produce 
highly  redundant,  highly  accurate  "fin- 
ish" sequence  on  targets  of  interest  Fi- 
nally, a  ttam  of  researchers  focuses 
specifically  on  dewlopmcnt,  testing, 
and  implementation  of  new  protocols 


Construttitm  of  the  human 
chrrjmrjsome  1 'J  physical  map  was 
based  on  a  similar  strategy  for 
mapping  the  roundworm 
Ciienorhjtxlitis  elegans.  Wfvv  the 
complete  map  on  the  World  Wide 
Web  fhttp-./Zft-ww-bio. IJnl.gov,' 
gettorac>html/chrom_niap.htmi). 

ISiturre  Adapted ffiimjtgure pnnided 
hy  Lmdo  Arhwntib.  UML} 

for  the  entire  group,  with  an  emphasis 
on  improving  the  efficiency  and  cost  ba- 
sis of  the  sequencing  operation. 

Resources 

The  resources  group  provides  mapped 
clonal  resources  to  the  sequencing 
teams.  This  group  performs  physical 
mapping  as  needed  for  the  DNA  se- 
quencing group  by  using  fingerprinting, 
restriction  mapping,  fluorescence  in  situ 
hybridization,  and  other  techniques.  A 
small  mapping  effort  is  under  way  to 
identify,  isol^e.  and  characterize  BAC 
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Putative-Gene  Classification.  Tfie  figure  depicts  the  functional  classificatior.  of  putative  ^enes  identified  in  a  !. 02-Mb 
region  on  the  lon}{  arm  of  human  thromfisome  J9.  Analy^>Ly  f/fthe  completed  sequence  between  markers  D19S20H  and 
COX7A  }  revealed  -tS  open  reading  frames  ( OKFsj  or  putative  genes.  <An  URF  i  i  a  ON  A  region  eontainins^  specific 
sequences  that  signal  the  beginning  and  ending  of  a  gene.) 

Tkirry  of  these  putative  genes  Mere  found  to  ftave  sequence  similarities  to  a  wide  variety  ofk3wy.n  genes  or  proteins, 
including  some  involved  in  transcription,  cell  adliesion  and  signaling,  and  metabolism.  Many  appear  to  he  related 
funciianaUy  to  such  known  proteins  as  the  GJV-ase  activating  proteins  or  the  ETS  family  of  transcription  factors.  Others 
seem  to  be  new  members  of  existing  gene  families,  for  example,  the  mRNA  splicing  factor,  or  of  such  p.%eudogenes  as  the 
elongation  factor  Tu 

In  addition  to  those  that  could  he  classified,  l^  novel  gene.'!  were  identified,  including  one  with  high  sinularlrv  to  a 
predicted  QUI-  of  unknown  function  in  the  roundworm  Oienorhabditis  eiegaiis.  !S->iirce:  AdofurJ  frnm  aniph  I'vnided  by 
Linda  .\shwonSt,  LLSI.{ 


clones  (from  anywhere  in  the  human  ge- 
nome) that  relate  to  susceptibility  genes, 
for  example,  DNA  repair.  These  clones 
will  be  characterized  and  provided  for 
sequencing  and  at  the  same  time  con- 
tribute to  understanding  the  biology  of 
the  chromosome,  the  genome,  and  sus- 
ceptibility factors.  The  mapping  team 
also  collaborates  with  others  using  the 
chromosome  19  map  as  a  resource  for 
gene  hunting. 

Functional  Genomics 

The  functional  genomics  team  is  respon- 
sible for  assembhng  and  characterizing 
clones  for  the  Integrated  Molecular 
Analysis  of  Gene  Expression  (called 
IMAGE)  Consortium  and  cDNA  se- 
quencing, as  well  as  for  work  on  gene 
expression  and  comparative  mouse 


genomics.  The  effort  emphasizes  genes 
involved  in  DNA  repair  and  links 
strongly  to  LLNL's  gene-expression  and 
stnictural  biology  efforts.  In  addition, 
this  team  is  working  closely  with  Oak 
Ridge  National  Laboratory  (ORNL)  to 
develop  a  comparative  map  and  the  se- 
quence data  for  mouse  regions  syntenic 
to  human  chromosome  19  (see  p.  32). 

Informatics  and  Analytical 
Genomics 

The  informatics  and  analytical  genom- 
ics group  provides  computer  science 
support  to  biologists.  The  sequencing 
ixifonnatics  team  works  directly  with 
the  DNA  sequencing  group  to  facilitate 
and  automate  sample  handing,  data  ac- 
quisition and  storage,  and  DNA  se- 
quence analysis  and  annotation.  The 
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analytical  genomics  team  provides  sta- 
tistical and  advanced  algorithmic  exper- 
tise. Tasks  include  development  of 
model-based  methods  for  data  capture, 
signal  processing,  and  feature  extraction 
for  DNA  sequence  and  fingerprinting 
data  and  analysis  of  the  effectiveness  of 
newly  proposed  methods  for  sequencing 
and  mapping. 

Instrumentation 

The  instrumentation  group  also  has 
multiple  components.  Group  members 
provide  expertise  in  instrumentation  and 
automation  in  high-throughput  electro- 
phoresis, preparation  of  high-density 
replicate  DNA  and  colony  filters,  fluo- 
rescence labeling  technologies,  and  au- 
tomated sample  handling  for  DNA 
sequencing.  To  facilitate  seamless  inte- 
gration of  new  technologies  into  pro- 
duction use,  this  group  is  coupled 
tightly  to  the  biologist  user  groups  and 
the  informatics  group. 

Collaborations 

The  center  interacts  extensively  with 
other  efforts  within  the  LLNL  Biology 
and  Biotechnology  Research  Program 
and  with  other  programs  at  LLNL,  the 
academic  community,  other  research  in- 
stitutes, and  industry.  More  than  250 
collaborations  range  from  simple  probe 
and  clone  sharing  to  detailed  gene  fam- 
ily studies.  The  following  list  reflects 
some  major  collaborations. 

•  Integration  of  the  genetic  map  of  hu- 
man chromosome  19  with  correspond- 
ing mouse  chromosomes  (ORNL). 

•  Miniaturized  polymerase  chain  reac- 
tion instrumentation  (LLNL). 

•  Sequencing  of  IMAGE  Consortium 
cDNA  clones  (Washington  Univer- 
sity, Sl  Louis). 

•  Mapping  and  sequencing  of  a  gene 
associated  with  Fmnish  congenital 
nephrotic  syndrome  (University  of 
Oulu,  Finland). 
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Accomplishments 

The  LLNL  Human  Genome  Center  has 
excelled  in  several  areas,  including 
comparative  genomic  sequencing  of 
DNA  repair  genes  in  human  and  rodent 
species,  construction  of  a  metric  physi- 
cal map  of  human  chromosome  19,  and 
development  and  application  of  new 
biochemical  and  mathematical  ap- 
proaches for  constructing  ordered  clone 
maps.  These  and  other  major  accom- 
plishments are  highlighted  below. 

•  Completion  of  highly  accurate  se- 
quencing totaling  1.6  million  bases 
of  DNA,  including  regions  spanning 
human  DNA  repair  genes,  the  candi- 
date region  for  a  congenital  kidney 
disease  gene,  and  other  regions  of 
biological  interest  on  chromo- 
some 19. 

•  Completion  of  comparative  sequence 
analysis  of  107,500  bases  of  genomic 
DNA  encompassing  the  human  DNA 
repair  gene  ERCC2  and  the  corre- 
sponding regions  in  mouse  and  ham- 
ster (p.  32).  In  addition  to  ERCC2. 
analysis  revealed  the  presence  of  two 
previously  undescribed  genes  in  all 
three  species.  One  of  these  genes  is  a 
new  member  of  the  kinesin  motor 
protein  family.  These  proteins  play  a 
wide  variety  of  roles  in  the  cell,  in- 
cluding movement  of  chromosomes 
before  cell  division. 

•  Complete  sequencing  of  human  ge- 
nomic regions  containing  two  addi- 
tional DNA  repair  genes.  One  of 
these,  XRCCi,  maps  to  human  chro- 
mosome 14  and  encodes  a  protein 
that  may  be  required  for  chromo- 
some stability.  Analysis  of  the  ge- 
nomic sequence  identified  another 
kinesin  motor  protein  gene  physi- 
cally linked  to  XRCCi.  The  second 
human  repair  gene.  HHR23A.  maps 
to  19pl3.2.  Sequence  analysis  of 

1 10,000  bases  containing  HHR23A 
identified  six  other  genes,  five  of 
which  are  new  genes  with  similarity 
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to  proteins  from  mouse,  human, 
yeast,  and  Caenorhabditis  elegans. 

'    Complete  sequencing  of  full-length 
cDNAs  for  three  new  DNA  repair 
genes  (XRCC2.  XRCC3,  and  X/iCC9) 
in  collaboration  with  the  IXNL  DNA 
repair  group. 

'    Generation  of  a  metric  physical  map 
of  chromosome  19  spanning  at  least 
95%  of  the  chromosome.  This  unique 
map  incorporates  a  metric  scale  to 
estimate  the  distance  between  genes 
or  other  markers  of  interest  to  the 
genetics  community. 

Assembly  of  nearly  45  million  bases 
of  EcoR  I  restriction-mapped  cosmid 
contigs  for  human  chromosome  19 
using  a  combination  of  fingerprinting 
and  cosmid  walking.  Small  gaps  in 
cosmid  continuity  have  been  spanned 
by  BAC,  PAC,  and  PI  clones,  which 
are  then  integrated  into  the  restriction 
maps.  The  high  depth  of  coverage  of 
these  maps  (average  redundancy, 
4.3-fold)  permits  selection  of  a  mini- 
mimi  overlapping  set  of  clones  for 
DNA  sequencing. 

Placement  of  more  than  400  genes, 
genetic  markers,  and  other  loci  on  the 
chromosome  19  cosmid  map.  Also, 
165  new  STSs  associated  with  pre- 
mapped  cosmid  contigs  were  gener- 
ated and  added  to  the  physical  map. 

Collaborations  to  identify  the  gene 
(COMP)  responsible  for  two  allelic 
genetic  diseases,  pseudoachondro- 
plasia  and  multiple  epiphyseal  dys- 
plasia, and  the  identification  of 
specific  mutations  causing  each 
condition. 

Through  sequence  analysis  of  the  2A 
subfamily  of  the  human  cytochrome 
P450  enzymes,  identification  of  a 
new  variant  that  exists  in  10%  to 
20%  of  individuals  and  results  in  re- 
duced abihty  to  metabolize  nicotine 
and  the  antiblood-clotting  drug 
Coumadin. 


•  Location  of  a  zinc  finger  gene  that 
encodes  a  transcription  factor  regu- 
lating blood-cell  development  adja- 
cent to  telomere  repeat  sequences, 
possibly  the  gene  nearest  one  end  of 
chromosome  19. 

•  Completion  of  the  genomic  and 
cDNA  sequence  of  the  gene  for  the 
human  Rieske  Fe-S  protein  involved 
in  mitochondrial  respiration. 

•  Expansion  of  the  mouse-human  com- 
parative genomics  collaboration  with 
ORNL  to  include  study  of  new 
groups  of  clustered  transcription  fac- 
tors found  on  human  chromosome 
19q  and  as  syntenic  homologs  on 
mouse  chromosome  7  (p.  32). 

•  Numerous  collaborations  (in  particu- 
lar, with  Washington  University  and 
Merck)  continuing  to  expand  the 
LLNL-based  IMAGE  Consortium, 
an  effort  to  characterize  the  tran- 
scribed human  genome.  The  IMAGE 
clone  collection  is  now  the  largest 
public  collection  of  sequenced  cDNA 
clones,  with  more  than  one  million 
arrayed  clones,  800,000  sequences  in 
public  databases,  and  10,000  mapped 
cDNAs. 

•  Development  and  deployment  of  a 
comprehensive  system  to  handle 
sample  tracking  needs  of  production 
DNA  sequencing.  The  system  com- 
bines databases  and  graphical  inter- 
faces running  on  both  Mac  and  Sun 
platforms  and  scales  easily  to  handle 
large-scale  production  sequencing. 

•  Expansion  of  the  LLNL  genome 
center's  World  Wide  Web  site  to  in- 
clude tables  that  link  to  each  gene  be- 
ing sequenced,  to  the  quality  scores 
and  assembled  bases  collected  each 
night  during  the  sequencing  process, 
and  to  the  submitted  GenBank  se- 
quence when  a  clone  is  completed. 
[http://bbrp.  llnl.gov/test-bin/ 
projqcsummary] 
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•  Implementation  of  a  new  database  to 
support  sequencing  and  mapping 
work  on  multiple  chromosomes  and 
species.  Web-based  automated  tools 
were  developed  to  facilitate  construc- 
tion of  this  database,  the  loading  of 
over  100  million  bytes  of  chromosome 
19  data  from  the  existing  LLNL  data- 
base, and  automated  generation  of 
Web-based  input  interfaces. 

•  Significant  enhancement  of  the 
LLNL  Genome  Graphical  Database 
Browser  software  to  display  and  link 
information  obtained  at  a  subcosmid 
resolution  from  both  restriction  map 
hybridization  and  sequence  feature 
data.  Features,  such  as  genes  linked 
to  diseases,  allow  tracking  to  frag- 
ments as  small  as  SOO  base  pairs  of 
DNA. 

•  Development  of  advanced  micro- 
fabrication  technologies  to  produce 
electrophoresis  microchannels  in 
large  glass  substrates  for  use  in  DNA 
sequencing. 

•  Installation  of  a  new  filter-spotting 
robot  that  routinely  produces  6x6 

X  384  filters.  A  16  x  16  x  384  pattern 
has  been  achieved. 

•  Upgrade  of  the  Lawrence  Berkeley 
National  Laboratory  colony  picker 
using  a  second  computer  so  that  im- 
aging and  picking  can  occur  simulta- 
neously. 

Future  Plans 

Genomic  sequencing  currently  is  the 
dominant  function  of  Livermore's  Hu- 
man Genome  Center.  The  physical  map- 
ping effort  will  ensure  an  ample  supply 
of  sequence-ready  clones.  For  sequenc- 
ing targets  on  chromosome  19,  this  in- 
cludes ensuring  that  the  most  stable 
clones  (cosmids,  BACs,  and  PACs)  are 
available  for  sequencing  and  that  re- 
gions with  such  known  physical  land- 
marks as  STSs  and  expressed  sequenced 
tags  (ESTs)  are  annotated  to  facilitate 
sequence  assembly  and  analysis.  The 


following  targets  are  emphasized  for 
DNA  sequencing: 

•  Regions  of  high  gene  density,  includ- 
ing regions  containing  gene  families. 

•  Chromosome  19,  of  which  at  least  42 
million  bases  are  sequence  ready. 

•  Selected  BAC  and  PAC  clones  repre- 
senting regions  of  about  0.2  million 
to  1  million  bases  throughout  the 
human  genome;  clones  would  be 
selected  based  on  such  high-priority 
biological  targets  as  genes  involved 
in  DNA  repair,  replication,  recombi- 
nation, xenobiotic  metabolism,  cell- 
cycle  checkpoints,  or  other  specific 
targets  of  interest. 

•  Selected  BAC  and  PAC  clones  from 
mouse  regions  syntenic  with  the 
genes  indicated  above. 

•  Full-insert  cDNAs  corresponding  to 
the  genomic  DNA  being  sequenced. 

The  informatics  team  is  continuing  to 
deploy  broader-based  supporting  data- 
bases for  both  mapping  and  sequencing. 
Where  appropriate,  Web-  and  Java-based 
tools  are  being  developed  to  enable  bi- 
ologists to  interact  with  data.  Recent  re- 
organization within  this  group  enables 
better  direct  support  to  the  sequencing 
group,  including  evaluating  and  inter- 
facing sequence-assembly  algorithms 
and  analysis  tools,  data  and  process 
tracking,  and  other  informatics  func- 
tions that  will  streamline  the  sequencing 
process. 

The  instrumentation  effort  has  three 
major  thrusts:  (1)  continued  develop- 
ment or  implementation  of  laboratory 
automation  to  support  high-throughput 
sequencing:  (2)  development  of  the 

next-generation  DNA  sequencer;  and  » 

(3)  development  of  robotics  to  support 
high-density  BAC  clone  screening.  The 
last  two  goals  warrant  further  expla- 
nation. 

The  new  DNA  sequencer  being  devel- 
oped under  a  grant  from  the  National 
Institutes  of  Health,  with  minor  support 
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through  the  DOE  genome  center,  is  de- 
signed to  run  384  lanes  simultaneously 
with  a  low- viscosity  sieving  medium. 
The  entire  system  would  be  loaded  au- 
tomatically, run,  and  set  up  for  the  next 
run  at  3-hour  intervals.  If  successful,  it 
should  provide  a  20-  to  40-fold  increase 
in  throughput  over  existing  machines. 

An  LLNL-designed  high-precision  spot- 
ting robot,  which  should  allow  a  density 
of  98,304  spots  in  96  cm^  is  now  oper- 
ating. The  goal  of  this  effort  is  to  create 
high-density  filters  representing  a  I  Ox 
BAC  coverage  of  both  human  and 
mouse  genomes  (30,000  clones  =  Ix 
coverage).  Thus  each  filter  would  pro- 
vide -3x  coverage,  and  eight  such  filters 
would  provide  the  desired  coverage  for 
both  genomes.  The  filters  would  be  hy- 
bridized with  amplicons  from  individual 
or  region-specific  cDNAs  and  ESTs; 
given  the  density  of  the  BAC  libraries, 
clones  that  hybridize  should  represent  a 
binned  set  of  BACs  for  a  region  of  in- 
terest. These  BACs  could  be  the  initial 
substrate  for  a  BAC  sequencing  strategy. 
Performing  hybridizations  in  parallel  in 
mouse  and  human  DNA  facilitates  the 
development  of  the  mouse  map  (with 
ORNL  involvement),  and  sequencing 


BACs  from  both  species  identifies 
evolutionarily  conserved  and,  perhaps, 
regulatory  regions. 

Information  generated  by  sequencing 
human  and  mouse  DN.\  in  parallel  is 
expected  to  expand  LLNL  efforts  in 
functional  genomics.  Comparative  se- 
quence data  will  be  used  to  develop  a 
high-resolution  synteny  map  of  con- 
served mouse-human  domains  and 
incorporate  automated  northern  ex- 
pression analysis  of  newly  identified 
genes.  Long  range,  the  center  hopes  to 
take  advantage  of  a  variety  of  forms  of 
expression  analysis,  including  site- 
directed  mutation  analysis  in  the  mouse. 

Summary 

The  Livermore  Human  Genome  Center 
has  undergone  a  dramatic  shift  in  empha- 
sis toward  commitment  to  large-scale, 
high-accuracy  sequencing  of  chromo- 
some 19,  other  chromosomes,  and  tar- 
geted genomic  regions  in  the  human 
and  mouse.  The  center  also  is  commit- 
ted to  exploiting  sequence  information 
for  functional  genomics  studies  and  for 
other  programs,  both  in  house  and 
collaboratively. 
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Research  Narratives 

Los  Alamos  National  Laboratory  Center  for  Human  Genome  Studies 


Biological  research  was  ini- 
tiated at  Los  Alamos  Na- 
tional Laboratory  (LANL) 
in  the  1940s,  when  the 
laboratory  began  to  inves- 
tigate the  physiological  and  genetic 
consequences  of  radiation  exposure. 
Eventual  establishment  of  the  national 
genetic  sequence  databank  called 
GeoBank,  the  National  Flow  Cytometry 
Resource,  numerous  related  individual 
research  projects,  and  fulfillment  of  a  key 
role  in  the  National  Laboratory  Gene  Li- 
brary Project  all  contributed  to  LANL's  se- 
lection as  the  site  for  the  Center  for 
Human  Genome  Studies  in  1988. 

Center  Organization 
and  Activities 

The  LANL  genome  center  is  organized 
into  four  broad  areas  of  research  and  sup- 
port: Physical  Mapping,  DNA  Sequenc- 
ing, Technology  Development,  and 
Biological  Interfaces.  Each  area  consists 
of  a  variety  of  projects,  and  work  is  dis- 
tributed among  five  LANL  Divisions 
(Life  Sciences;  Theoretical,  Computing. 
Information,  and  Communications; 
Chemical  Science  and  Technology;  and 
Engineering  Sciences  and  Applications). 
Extensive  interdisciplinary  interactions 
are  eacouraged. 

Physical  Mapping 

The  construction  of  chromosome-  and 
region-specific  cosmid,  bacterial  artifi- 
cial chromosome  (BAC),  and  yeast  artifi- 
cial chromosome  (YAC)  recombinant 
DNA  libraries  is  a  primary  focus  of 
physical  mapping  activities  at  LANL. 
Specific  work  includes  the  construction 
of  high-resolution  maps  of  human  chro- 
mosomes 5  and  16  and  associated 
informatics  and  gene  discovery  tasks. 

AccompUshments 

•  Completion  of  an  integrated  physical 
map  of  human  chronK)Some  16  con- 
sisting of  both  a  low -resolution  YAC 


contig  map  and  a  high-resolution 
cosmid  contig  map  (pp.  37-39). 
With  sequence  tagged  site  (STS) 
markers  provided  on  average  every 
i25,0(X)  bases,  the  YAC-STS  map 
provides  almost-complete  coverage 
of  the  chromoson>e's  euchromatic 
arms.  All  available  loci  continue  to 
be  iiKorporated  into  the  map. 

•  Construction  of  a  low-resolution  STS 
map  of  human  chromosome  5  con- 
sisting of  517  STS  markers  region- 
ally assigned  by  somatic-cell  hybrid 
approaches.  Around  95%  mega- 
YAC-STS  coverage  (50  million 
bases)  of  5p  has  been  achieved.  Ad- 
ditionally, about  40  million  bases  of 
5q  mega- YAC-STS  coverage  have 
been  obtained  collaboratively. 

•  Refinement  of  BAC  cloning  proce- 
dures for  future  production  of 
chromosome-specific  libraries. 
Successful  partial  digestion  and  clon- 
ing of  microgram  quantities  of  chro- 
mosomal DNA  embedded  in  agarose 
plugs.  Efforts  continue  to  iitcrease 
the  average  insert  size  to  about 
100.000  bases. 

DNA  Sequencing 

DNA  sequencing  at  the  LANL  center 
focuses  on  low-pass  sample  sequencing 
(S  ASE)  of  large  genomic  regions.  SASE 
data  is  deposited  in  publicly  available 
databases  to  allow  for  wide  distribution. 
Finished  sequencing  is  prioritized  from 
initial  SASE  analysis  and  pursued  by  par- 
allel primer  walking.  Infonnatics  devel- 
opment includes  data  tracking,  gene- 
discovery  integration  with  the  Sequence 
Comparison  ANalysis  (SCAN)  program, 
and  functional  genomics  interaction. 

Accomplishments 

•  SASE  sequencing  of  1.5  million 
bases  from  the  pl3  region  of  human 
chromosome  16. 

•  Discovery  of  more  than  100  genes  in 
SASE  sequences. 


Crnirr  fnr  Human  Genome 

Studies 
LosAlammNatiooal  Laboratory 
P.O.  Box  1663 
Los  ,Vlamo5.NM  87545 

I>arry  L.  Deaven 
Acting  Director 
505/667-3912.  Fax:  -2891 
ldeaven<?>telom€rc.UinLgov 

Lynn  Clark 
Technicitl  Cnordinatnr 
505/667-9376.  Fax: -2891 
cUirk@telomere.  lattLgo  v 

Robert  K.  Moyzi.1 
Director.  1V8S-W* 


In  lieu  of  individual  abstracts, 
research  projects  and  investi- 
gators at  the  LANL  Center  for 
Human  Genome  Studies  are 
represented  in  this  narrative. 
More  inforrnation  can  be  found 
on  the  center's  Web  site  (see 
URL  above). 


Update 


In  1997  Lawrence  Berkeley  Na- 
tional Laboratory,  Lawrence 
Livennore  National  Laboratory, 
and  Los  Alamos  National  Labora- 
tory began  collaborating  in  a  Joint 
Genome  Institute  to  iinplemem 
high-throughpui  sequencing  (see 
p.  26  and  Human  Genome  News 
8(2).  1-2]. 


*Now  at  Uoiversiiy  of  Califor- 
ma.  Irvine 
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•  Generation  of  finished  sequence 
for  a  240,000-base  telomeric  re- 
gion of  human  chromosome  7q. 
From  initial  sequences  generated 
by  SASE,  oligonucleotides  were 
synthesized  and  used  for  primer 
walking  directly  from  cosmids 
comprising  the  contig  map.  Com- 
plete sequencing  was  performed  to 
determine  what  genes,  if  any,  are 
near  the  7q  terminus.  This  intri- 
guing region  lacks  significant 
blocks  of  subtelomeric  repeat  DNA 
typically  present  near  eukaryotic 
telomeres. 

•  Complete  single-pass  sequencing  of 
2018  exon  clones  generated  from 
LANL's  flow-sorted  human  chromo- 
some 16  cosmid  library.  About  950 
discrete  sequences  were  identified  by 
sequence  analysis.  Nearly  800  appear 
to  represent  expressed  sequences 
from  chromosome  16. 

•  Development  of  Sequence  Viewer  to 
display  ABI  sequences  with  trace 
data  on  any  computer  having  an 
Internet  connection  and  a  Netscape 
World  Wide  Web  browser. 

•  Sequencing  and  analysis  of  a  novel 
pericentromeric  duplication  of  a 
gene-rich  cluster  between  16pll.l 
and  Xq28  (in  collaboration  with 
Baylor  College  of  Medicine). 

Technology  Development 

Technology  development  encompasses 
a  variety  of  activities,  both  short  and 
long  term,  including  novel  vectors  for 
library  construction  and  physical  map- 
ping; automation  and  robotics  tools  for 
physical  mapping  and  sequencing; 
novel  approaches  to  DNA  sequencing 
involving  single-molecule  detection; 
and  novel  approaches  to  informatics 
tools  for  gene  identification. 


Accomplishmen  ts 

•  Development  of  SCAN  program  for 
large-scale  sequence  analysis  and  an- 
notation, including  a  translator  con- 
verting SCAN  data  to  GIO  format  for 
submission  to  Genome  Sequence 
DataBase. 

•  Application  of  flow-cytometric  ap- 
proach to  DNA  sizing  of  PI  artificial 
chromosome  (PAC)  clones.  Less  than 
one  picogram  of  linear  or  supercoiled 
DNA  is  analyzed  in  under  3  minutes. 
Sizing  range  has  been  extended 
down  to  287  base  pairs.  Efforts  con- 
tinue to  extend  the  upper  limit  be- 
yond 167,000  bases. 

•  Characterization  of  the  detection  of 
single,  fluorescently  tagged  nucleo- 
tides cleaved  from  multiple  DNA 
fragments  suspended  in  the  flow 
stream  of  a  flow  cytometer  (see  pic- 
ture, p.  70).  The  cleavage  rate  for 
Exo  III  at  37°C  was  measured  to  be 
about  5  base  pairs  per  second  per 
M13  DNA  fragment.  To  achieve  a 
single-color  sequencing  demonstra- 
tion, either  the  background  burst  rate 
(currently  about  5  bursts  per  second) 
must  be  reduced  or  the  exonuclease 
cleavage  rate  must  be  increased  sig- 
nificantly. Techniques  to  achieve 
both  are  being  explored. 

•  Construction  of  a  simple  and  com- 
pact apparatus,  based  on  a  diode- 
pumped  Nd:YAG  laser,  for  routine 
DNA  fragment  sizing. 

•  Development  of  a  new  approach  to 
detect  coding  sequences  in  DNA. 
This  complete  spectral  analysis  of 
coding  and  noncoding  sequences  is 
as  sensitive  in  its  first  implementa- 
tions as  the  best  existing  techniques. 

•  Use  of  phylogenetic  relationships  to 
generate  new  profiles  of  amino  acid 
usage  in  conserved  domains.  The 
profiles  are  particularly  useful  for 
classification  of  distantly  related 
sequences. 
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Biological  Interfaces 

The  Biological  Interfaces  effort  targets 
genes  and  chromosome  regions  asso- 
ciated with  DNA  damage  and  repair, 
mitotic  stability,  and  chromosome  struc- 
ture and  function  as  primary  subjects 
for  physical  mapping  and  sequencing. 
Specific  disease-associated  genes  on 
human  chromosome  5  (e.g.,  Cri-du-Chat 
syndrome)  and  on  16  (e.g..  Batten's  dis- 
ease and  Fanconi  anemia)  are  the  sub- 
jects of  collaborative  biological 
projects. 

Accomplishments 

•  Identification  of  two  human  7q  exons 
having  99%  homology  to  the  cDNA 
of  a  known  human  gene,  vasoactive 
intestinal  peptide  receptor  2A.  Pre- 
liminary data  suggests  that  the 
VIPR2A  gene  is  expressed. 

•  Identification  of  numerous  expressed 
sequence  tags  (ESTs)  localized  to  the 
7q  region.  Since  three  of  the  ESTs 
contain  at  least  two  regions  with  high 
confidence  of  homology  (-90%), 
genes  in  addition  to  VIPR2A  may 
exist  in  the  terminal  region  of  7q. 

•  Generation  of  high-resolution  cosmid 
coverage  on  human  chromosome  5p 
for  the  larynx  and  critical  regions 
identified  with  Cri-du-Chat  syndrome, 
the  most  common  human  terminal- 
deletion  syndrome  (in  collaboration 
with  Thomas  Jefferson  University). 

•  Refinement  of  the  Wolf-Hirschhom 
syndrome  (WHS)  critical  region  on 
human  chromosome  4p.  Using  the 
SCAN  program  to  identify  genes 
likely  to  contribute  to  WHS,  the 
project  serves  as  a  model  for  defining 
the  interaction  between  genomic  se- 
quencing and  clinical  research. 

•  Collaborative  construction  of  contigs 
for  human  chromosome  16,  includ- 
ing 1 .05  million  bases  in  cosmids 
through  the  familial  Mediterranean 
fever  (FMF)  gene  region  (with 


members  of  the  FMF  Consortium) 
and  700,000  bases  in  PI  clones  en- 
compassing the  polycystic  kidney 
disease  gene  (with  Integrated 
Genetics,  Inc.). 

Collaborative  identification  and  de- 
termination of  the  complete  genomic 
structiu^  of  the  Batten's  disease  gene 
(with  members  of  the  BDG  Consor- 
tium), the  gamma  subunit  of  the  hu- 
man amiloride-sensitive  epithelial 
channel  (Liddle's  syndrome,  with 
University  of  Iowa),  and  the  polycys- 
tic kidney  disease  gene  (with  Inte- 
grated Genetics). 

Participation  in  an  international  col- 
laborative research  consortium  that 
successfully  identified  the  gene  re- 
sponsible for  Fanconi  anemia  type  A. 


Chromosome  16  Physical  Map  {pp.  3li-39).  A  condensed  throiiw.some  16 
physical  map  constructed  at  Los  Alamos  National  Laboratory  iLANL)  i< 
shown  in  two  parts  on  the  foltowir.^  pages.  Besides  facilitating  the  isolation 
and  charactnrizatinn  oj  di.wase  genes,  the  inap  provides  the  framework  for 
a  large-scale  sequencing  effort  by  lANL  Ihe  Institute  for  Genomic 
Research,  and  the  Sanger  Centre. 

Distinct  types  of  maps  and  data  are  shown  as  levels  or  tiers  on  the 
integrated  map.  /\t  the  lop  of  each  page  is  a  view  /if  the  handed  human 
chromosome  to  which  the  map  is  alr^ned.  A  somalic-cel!  hybrid  breakpoint 
map,  which  divides  the  chromosome  into  90  mlen'a's.  was  u.ied  as  a 
backbone  for  much  of  the  map  integration. 

The  physual  map  consists  of  both  a  low-resolution  vea.st  artificial 
chromosome  I YAC)  c.ontig  map  localized  to  and  ordered  within  the 
breakpoint  intervals  with  .sequence  tagged. sites  (SfSs)  and  a  high- 
resolution  bacteria-ba.sed  clone  map.  Vie  YAC-STS  map  provides  almost 
complete  coverage  of  the  chrnino.^ome's  euchromatic  arm,  with  STS 
markers  on  average  every  100.000  bases. 

A  high- resolution,  sequence-ready  cosmid  contig  map  is  anchored  to  the 
MC  and  breakpoint  maps  via  STSs  developed  f mm  cosmid  contigs  and  by 
hybridizations  berv^een  YACs  and  cosmids. 

.As  part  of  the  ongoing  effort  to  incorporate  all  available  loci  onto  a  single 
map  of  this  chromosome,  the  integrated  map  also  features  genes,  expressed 
sequence  tags,  e.Kons  (gene-coding  regions),  and  genetic  markers. 

The  mouse  chromosome  segments  at  the  bottom  of  the  map  contain  gn.^iips 
that  correspond  to  human  genes  mapped  to  the  regionx  shown  above  them. 

ISouKe:  tvoimim  Oogj>en.  L\t^LJ 
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The  exhibit  "Un..,  

Science  Museum  tn  Ijj.\  AUiina.\,  At iw  Mcxim.  de\LnbtA  ike  lJu\L  Onier 
for  Human  Genome  Sttuiies '  conrributions  to  the  Human  Genome  Project. 
The  exftihit's  centerpiece  m  a  Id-foof-long  version  o/LANL'x  niap  of  human 

chnmiOSOmc  16.  {Sourrr.  I  AM  Cmltr/iii  Humtin  Ctnomr  Slurli>fsj 

Patents,  Licenses,  and 
CRADAs 

•  Rhen  L.  Affleck,  James  N.  Demas, 
Peter  M  Gocxiwin.  Jay  A.  Schecker. 
Ming  Wu.  and  Richard  A.  Keller, 
"Reduction  of  Diffusional  Defocusing 
in  Hydrodynamically  Focused  Flows 
by  Complexing  with  a  High  Molecular 
Weight  Adduct,"  United  States  Patent, 
filed  December  1996. 

•  R.L.Affleck,  W.P.Ambrose,  J. D. 
Demas,  P.M.  Goodwin.  M.E.  Johnson, 
R.A.  Keller.  J.T.  Petty.  J.A.  Schecker. 
and  M.  Wu.  "Photobleaching  to  Re- 
duce or  Eliminate  Luminescent  [mpu- 
rilies  for  Ultrasensitive  Luminescence 
Analysis."  United  States  Patent,  S-87, 
208,  accepted  September  1997. 
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'  J.H.  Jeti,  M.L.  Hammond. 
R.A.  KeUer.  B.L. 
Marrone,  and  J.C.  Martin, 
"DNA  Fragment  Sizing 
and  Sorting  by  Laser- 
Induced  Fluorescence." 
United  Slates  Patent. 
S.N.  75.001.  allowed 
May  1996. 

■  James  H.  Jett.  "Method 
for  Rapid  Base  Sequenc- 
ing in  DNA  and  RNA 
with  Three  Base  Label- 
ing," in  preparatioD. 

"   Development  license  and 
exclusive  license  to 
LANL's  DNA  sizing 
patent  obtained  by  Mo- 
lecular Technology.  Inc.. 
for  commercialization  of 
single -molecule  detection 
capability  to  DNA  sizing. 

Future  Plans 


LANL  has  joined  a  collabo- 
ration with  California  Institute  of  Tech- 
nology and  The  Institute  for  Genomic 
Research  to  construct  a  BAC  map  of 
the  p  arm  of  human  chromosome  1 6 
and  to  complete  the  sequence  of  a  20- 
railUon-base  region  of  this  map. 

In  its  evolving  role  as  part  of  the  new 
DOE  Joint  Genome  Institute.  LANL 
will  continue  scaleup  activities  focused 
on  high- throughput  DNA  sequencing. 
Initial  targets  include  genes  and  DNA 
regions  associated  with  chromosome 
structure  and  function,  syntenic  break- 
points, and  relevant  disease-gene  loci. 

A  joint  DNA  sequencing  center  was  es- 
tablished recently  by  LANL  at  the  Uni- 
versity of  New  Mexico.  This  facility  is 
responsible  for  determining  the  DNA 
sequence  of  clones  constructed  at  LANL. 
then  returning  the  data  to  LANL  for 
analysis  and  archiving. 
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Research  Narratives 

Lawrence  Berkeley  National  Laboratory  Human  Genome  Center 
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Since  1937  the  Ernest  Or- 
lando Lawrence  Berkeley 
National  Laboratory 
(LBNL)  has  been  a  major 
contributor  to  knowledge 
about  human  health  effects  resulting 
from  energy  production  and  use.  That 
was  the  year  John  Lawrence  went  to 
Berkeley  to  use  his  brother  Ernest's 
cyclotrons  to  launch  the  application  of 
radioactive  isotopes  in  biological  and 
medical  research.  Fifty  years  later. 
Berkeley  Lab's  Human  Genome  Center 
was  established. 

Now.  after  another  decade,  an  expansion 
of  biological  research  relevant  to  Hu- 
man Genome  Project  goals  is  being  car- 
ried out  within  the  Life  Sciences 
Division,  with  support  from  the  Infor- 
mation and  Computing  Sciences  and 
Engineenng  divisions.  Individuals  in 
these  research  projects  are  making 
important  new  contributions  to  the 
key  fields  of  molecular,  cellular,  and 
structural  biology;  physical  chemistry; 
data  management;  and  scientific  instru- 
mentation. Additionally,  industry  in- 
volvement in  this  growing  venture  is 
stimulated  by  Berkeley  Lab's  location 
in  the  San  Francisco  Bay  area,  home  to 
the  largest  congregation  of  biotechnol- 
ogy research  facilities  in  the  world. 

In  July  1997  the  Berkeley  genome 
center  became  pan  of  the  Joint  Genome 
Institute  (see  p.  26). 

Sequencing 

Large-scale  genomic  sequencing  has 
been  a  central,  ongoing  activity  at  Ber- 
keley Lab  since  1991.  It  has  been 
funded  jointly  by  DOE  (for  human  ge- 
nome production  sequencing  and  tech- 
nology development)  and  the  NIH 
National  Human  Genome  Research  In- 
stitute [for  sequencing  the  Drosophila 
melanogaster  model  system,  which  is 
carried  out  in  partnership  with  the  Uni- 
versity of  California,  Berkeley  (UCB)l. 
The  human  genome  sequencing  area  at 
Berkeley  Lab  consists  of  five  groups: 


Bio  instrumentation.  Automation. 
Informatics,  Biology,  and  Development. 
Complementing  these  activities  is  a 
group  in  Life  Sciences  Division  devoted 
to  functional  genomics,  including  the 
transgenics  program. 

The  directed  DNA  sequencing  strategy 
at  Berkeley  Lab  was  designed  and 
implemented  to  increase  the  efficiency 
of  genomic  sequencing  (see  figure, 
p.  45).  A  key  element  of  the  directed  ap- 
proach is  maintaining  information  about 
the  relative  positions  of  potential  se- 
quencing templates  throughout  the  entire 
sequencing  process.  Thus,  intelligent 
choices  can  be  made  about  which  tem- 
plates to  sequence,  and  the  number  of 
selected  templates  can  be  kept  to  a 
minimum.  More  important,  knowledge 
of  the  interrelationship  of  sequencing 
runs  guides  the  assembly  process,  mak- 
ing it  more  resistant  to  difficulties  im- 
posed by  repeated  sequences.  As  of 
July  3.  1997.  Berkeley  Lab  had  generated 
4.4  megabases  of  human  sequence  and. 
in  collaboration  with  UCB.  had  tallied 
7.6  megabases  of  Drosophila  sequence. 

Instrumentation  and 
Automation 

The  instrumentation  and  automation 
program  encompasses  the  design  and 
fabrication  of  custom  apparatus  to  facili- 
tate experiments,  the  programming  of 
laboratory  robots  to  automate  repetitive 
procedures,  and  the  development  of 
(I)  improved  hardware  to  extend  the 
applicability  range  of  existing  commer- 
cial robots  and  (2)  an  integrated  operat- 
ing system  to  control  and  monitor 
experiments.  Although  some  discrete 
instrumentation  modules  used  in  the 
integrated  protocols  are  obtained  com- 
mercially. LBNL  designs  its  own  custom 
instruments  wher  existing  capabilities  are 
inadequate.  The  instnimentalion  modules 
arc  then  integrated  into  a  large  system 
to  faciliute  large-scale  production 
sequencing.  In  addition,  a  significant 
effort  is  devoted  to  improving 
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In  lieu  of  individual  abstracts, 
research  projects  and  investi- 
gators at  the  LBNL  Human 
Genome  Center  are  repre- 
sented in  this  narrative.  More 
information  can  be  found  on 
the  centers  Web  site  (see  URL 
above). 


Update 


In  1997  Lawrence  Berkeley  Na- 
tional Laboratory.  Lawrence 
Ltvemion:  National  Laboruory, 
and  Los  Alamos  National  Labora- 
tory began  collaborating  In  a  Joint 
Genome  Institute  to  implement 
high-throughput  sequencing  Isee 
p  26  and  Human  Genome  News 
8(2).  l-2|. 


•Now  at  Amgco,  Inc. 
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DSA  Prep  Machine.  The  DiS'A 
Frrp  moihine  {above)  was 
designed  by  fterlu'fey  Lab  *v 
Martin  PolianJ  to  perform 
plasmtd  preparation  on  192 
samples  (2  rmcrofiter  plates) 
in  about  2.  .*>  to  4  hours. 
depending  on  the  protocol. 
Controlled  by  a  personal 
computer  running  a  Visual 
Bfi.\ic  Control  program,  the 
instrument  includes  a  gantry 
robot  equipped  v/ith  pipettor^. 
reagent  dispensers,  hot  and 
cold  temperature  stations,  arid 
a  r>neumatic  gripper.  fStmn.-r: 
IJiSl.'. 


fluorescence-assay  methods,  including 
DNA  sequence  analysis  and  mass  spec- 
trometry for  molecular  sizing. 

Recent  advances  in  the  instrumentation 
group  include  DNA  Prep  machine  and 
Prep  Track.  These  instruments  are  de- 
signed to  automate  completely  the  highly 
repetitive  and  labor-intensive  DNA- 
preparation  procedure  to  provide  higher 
daily  throughput  and  DNA  of  consistent 
quality  for  sequencing  (see  photos,  p.  43. 
and  Web  pages;  http://hgighub.lbl.gov/ 
esd/DNAPrep/TitlePage.html  and  http:// 
hgighub.lbl.gov/esd/prepTrackWebpage/ 
preptrack-htm). 

Berkeley  Lab's  near-term  needs  are  for 
960  samples  per  day  of  DNA  extracted 
from  ovemighi  bacteria  growths.  The 
DNA  protocol  is  a  modified  boil  prep 
prepared  in  a  %-well  format.  Overnight 
bacteria  growths  arc  lysed,  and  samples 
are  separated  from  cell  debris  by  cen- 
thfugation  The  DNA  is  recovered  by 
ethanol  precipitation. 

Informatics 

The  informatics  group  is  focused  on 
hardware  and  software  support  and 
system  administration,  software 


development  for  end  sequencing, 
transposon  mapping  and  sequence  tem- 
plate selection,  data-flow  automation, 
gene  finding,  and  sequence  analysis 
Data-flow  automation  is  the  main  em- 
phasis. Six  key  steps  have  been  identi- 
fied in  this  process,  and  software  is 
being  written  and  tested  to  automate  all 
six.  The  first  step  involves  controlling 
gel  quality,  trimming  vector  sequence, 
and  storing  the  sequences  in  a  database. 
A  program  module  called  Move-Track- 
Tnm,  which  is  now  used  in  production, 
was  written  to  handle  these  steps.  The 
second  through  fourth  steps  in  this  pro- 
cess involve  assembling,  editing,  and 
reconstructing  PI  clones  of  80,000  base 
pairs  from  400-base  traces.  The  fifth 
step  is  sequence  annotation,  and  the 
sixth  is  data  submission. 

Annotation  can  greatly  enhance  the  bio- 
logical value  of  these  sequences.  Useful 
annotations  include  homologies  to 
known  genes,  possible  gene  locations, 
and  gene  signals  such  as  promoters. 
LBNL  is  developing  a  workbench  for 
automatic  sequence  annotation  and  an 
notation  viewing  and  editing.  The  goal 
is  to  run  a  series  of  sequence -analysis 
tools  and  display  the  results  to  compare 
the  various  predictions.  Researchers 
then  will  be  able  to  examine  all  the  an- 
notations (for  example,  genes  predicted 
by  various  gene-finding  methods)  and 
select  the  ones  that  look  best. 

Nomi  Hams  developed  Genotator,  an 
annotation  workbench  consisting  of  a 
stand-alone  annotation  browser  and  sev- 
eral sequence-analysis  functions.  The 
back  end  runs  several  gene  finders, 
homology  searches  (using  BLAST), 
and  signal  searches  and  saves  the  results 
in  ".ace"  format  Genotator  thus  auto- 
mates the  tedious  process  of  operating  a 
dozen  different  sequence-analysis  pro- 
grams with  many  different  input  and 
output  formats.  Genotator  can  function 
via  command-line  arguments  or  with 
the  graphical  user  interface  {http:// 
www'hgc.lbl.gov/mf/annotation.htm[). 
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Prep  Track.  Developed  at  the  Berkeley  Lab.  Prep  Track  is  a 

high-throughput,  microtiter-plate.  liquid-handling  rot'Otic 

system  for  automatins;  DSA  preparation  procedures. 

Microtiwr  plates  anf  fetched  from  cassettes,  moved  to  one  of 

fM'o  conveyor  belts,  and  transported  to  protocol-defined  modules. 

Plates  are  moved  continuously  and  automatically  through  the  system  as  each  module 

simultaneously  processes  plates  in  the  module  lift  statiotts.  Vne  plates  exit  the  system  and  are 

stored  in  microtiter-plate  cassetre<. 

Modules  include  a  station  capable  of  dispensing  liquid<i  in  volumes  from  as  lov.'  iv:  5  nucnyliters 
to  several  milliliters,  four  96-rhannel  pipettors.  arui  the  plate -fetching  module.  I'.ach  moflulc  is 
controlled  independently  bv  programmable  io'^ic  controllers  (PLCs).  The  overall  system  is 
controlled  by  a  personal  computer  and  a  Visuid  Hasie  Control  master  that  determines  thf  order 
in  which  plare-i  are  processed.  Tfie  actions  of  each  lift  station  and  dfspen.ser  or  pipettor  are 
determined  locally  hy  programs  resident  in  each  module  \  PLC  The  Visual  liasic  Control 
pnjgram  moves  the  plates  Inrow^h  the  system  based  on  the  predefined  protocol  and  on  module 
status  reports  av  monitowd  bv  PLCs. 

The  cunvnt  belt  length  on  the  Prep  Track  supports  eight  standard  modules,  which  can  be 
reconfigured  to  any  order  .Standardization  of  mechanical,  electrical,  and  communication 
components  allows  new  modules  to  he  designed  and  manufactured  easilv.  'The  current  standard 
module  footprint  >s  250  mm  wuie.  600  mm  deep,  and  250  mm  to  tlie  conveyor  belt  decL  The  first 
protocol  to  be  implemented  on  Prep  Track  will  be  polymerase  chain  reaction  setups,  y/ith 
sequence- reaction  setups  to  follow.  {S.mrce:  UiSL] 
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Progress  to  Date 
Chromosome  5 

Over  the  last  year,  the  center  has  focused 
its  production  genomic  sequencing  on  the 
distal  40  megabases  of  the  human  chro- 
mosome 5  long  arm.  This  region  was  cho- 
sen because  it  contains  a  cluster  of  growth 
factor  and  receptor  genes  and  is  likely  to 
yield  new  and  functionally  related  genes 
through  long-range  sequence  analysis. 
Results  to  date  include: 

•  40-megabase  nonchimeric  map  con- 
taining 82  yeast  artificial  chromosomes 
(YACs)  in  the  chromosome  5  distal 
long  arm. 

•  20-megabase  contig  map  in  the  region 
of  5q23-q33  that  contains  198  Pis,  60 
PI  artificial  chromosomes,  and  495 
bacterial  artificial  chromosomes 
(BACs)  linked  by  563  sequenced 
tagged  sites  (STSs)  to  form  contigs. 

•  20-megabase  bins  containing  370  BACs 
in  74  bins  in  the  region  of  5q33-q35. 

Chromosome  21 

An  early  project  in  the  study  of  Down 
syndrome  (DS),  which  is  characterized  by 
chromosome  2 1  trisomy,  constructed  a 
high-resolution  clone  map  in  the  chromo- 
some 2 1  DS  region  to  be  used  as  a  pilot 
study  in  generating  a  contiguous  gene 
map  for  all  of  chromosome  2 1 .  This 
project  has  integrated  PI  mapping  efforts 
with  transgenic  studies  in  the  Life  Sci- 
ences Division.  PI  maps  provide  a  suit- 
able form  of  genomic  DNA  for  isolating 
and  mapping  cDNA. 

•  1 86  clones  isolated  in  the  major  DS  re- 
gion of  chromosome  2 1  comprising 
about  3  megabases  of  genomic  DNA 
extending  from  D2IS17  to  ETS2. 
Through  cross-hybridization,  overlap- 
ping Pis  were  identified,  as  well  as 
gaps  between  two  PI  contigs,  and 
transgenic  mice  were  created  from  PI 
clones  in  the  DS  region  for  use  in  phe- 
notypic  studies. 


Transgenic  Mice 

One  of  the  approaches  for  determining 
the  biological  function  of  newly  identi- 
fied genes  uses  YAC  transgenic  mice. 
Human  sequence  harbored  by  YACs  in 
transgenic  mice  has  been  shown  to  be 
correctly  regulated  both  temporally  and 
spatially.  A  set  of  nonchimeric  overlap- 
ping YACs  identified  from  the  5q31  re- 
gion has  been  used  to  create  transgenic 
mice.  This  set  of  transgenic  mice,  which 
together  harbor  1.5  megabases  of  hu- 
man sequence,  will  be  used  to  assess  the 
expression  pattern  and  potential  func- 
tion of  putative  genes  discovered  in  the 
5q3 1  region.  Additional  mapping  and 
sequencing  are  under  way  in  a  region  of 
human  chromosome  20  amplified  in 
certain  breast  tumor  cell  lines. 

Resource  for  Molecular 
Cytogenetics 

Divining  landmarks  for  human  disease 
amid  the  enormous  plain  of  the  human 
genetic  map  is  the  mission  of  an  ambi- 
tious partnership  among  the  Berkeley 
Lab;  University  of  California,  San  Fran- 
cisco; and  a  diagnostics  company.  The 
collaborative  Resource  for  Molecular 
Cytogenetics  is  charting  a  course  toward 
important  sites  of  biological  interest  on 
the  23  pairs  of  human  chromosomes 
(http://rmc-www.lbl.gov). 

The  Resource  employs  the  many  tools 
of  molecular  cytogenetics.  The  most 
basic  of  these  tools,  and  the  cornerstone 
of  the  Resource's  portfolio  of  proprietary 
technology,  is  a  method  generally  known 
as  "chromo.some  painting,"  which  uses 
a  technique  referred  to  as  fluorescence 
in  situ  hybridization  or  FISH.  This  tech- 
nology was  invented  by  LBNL  Re- 
source leaders  Joe  Gray  and  Dan  Pinkel. 

A  technology  to  emerge  recently  from 
the  Resource  is  known  as  "Quantitative 
DNA  Fiber  Mapping  (QDFM)."  High- 
resolution  human  genome  maps  in  a 
form  suitable  for  DNA  sequencing  tra- 
ditionally have  been  constructed  by 


S447  DOE  Human  Oenome  Program  Report,  LBNL 


177 


stsi 


sts  2      sts  3 


sts4 


physical  mapping  clones 
genomic  region 


100 

1 


200 


300  (kb) 

1  -  shear  and  subclone  physical  mapping  clone 
2-  generate  spanning  set  using  end  sequencing 


single  mapping  clone 

minimal  spanning  set 
of  3-kb  subclones 


r 

0 


20 

1 


40 


60 


1 

80  (kb) 


generate  set  of  transposon  insertions 
in  each  3-kb  subclone  in  the  spanning  set 


^TV¥ 


v¥^  ¥ 


mapped  transposons 
(subset  to  be 
sequenced  shown 
in  solid  color) 


3(kb) 


▼ 


2  sequencing  runs 
from  each  selected 
transposon 


I 
-^00 


-200 


+200        +400  (bp) 


Sequtncing  Strategy.  The  directed  sequencing  smaegy  used  a;  LBNL  involves  four  steps:  (1)  generate  a 
Pi -based  ph.ysii.iil  map  {using  STS-content  mapping}  to  provide  a  set  ofmininiaHy  overlapping  clones, 
1 2)  shear  and  subclone  each  PI  clone  into  i-idlohase  fragments  and  identify  a  minimallv  overlapping 
siihcione  set,  (3)  generate  and  map  transposon  inserts  in  each  subclone,  and  (4)  .sequence  using 
commercial  pnmer-binding  sites  engineered  into  the  tran.spo.son.  Subclone  secjuences  ore  then  assembled 
arid  edited,  and  the  gaps  are  identified.  Pi  clones  are  reeonstnieted,  and  the  resulting  composite  data  is 
analyzed,  annotated  and  finally  submitted  to  the  diitniiases.  The  production  .sequencing  effort  has 
generated  12  meguba.ses  of  finished,  double-stranded  genomic  ONA  .sequence  from  both  Drosophila 
and  human  templates.   ISowcc:  Adapted  from  f.eun;  provuied  h\  UiSLl 


various  methods  of  fingerprinting,  hybrid- 
ization, and  identification  of  overlapping 
STSs.  However,  these  techniques  do  not 
readily  yield  information  about  sequence 
orientation,  the  extent  of  overlap  of  these 
elements,  or  the  size  of  gaps  in  the  map. 
Ulli  Weier  of  the  Resource  developed  the 
QDRVI  method  of  physical  map  assembly 
that  enables  the  mapping  of  cloned  DNA 
directly  onto  linear,  fully  extended  DNA 


molecules.  QDFM  allows  unambiguous 
assembly  of  critical  elements  leading  to 
high-resolution  physical  maps.  This  task 
now  can  be  accomplished  in  less  than 
2  days,  as  compared  with  weeks  by  con- 
ventional methods.  QDFM  also  enables 
detection  and  characterization  of  gaps  in 
existing  physical  maps — a  crucial  step 
toward  completing  a  definitive  human 
genome  map. 
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Research  Narratives 
University  of  Washington  Genome  Center 


The  Human  Geoome  Project 
soon  will  need  lo  increase 
rapidly  the  scale  at  which 
human  DNA  is  analyzed. 
The  ultimate  goal  is  to  de- 
termine the  order  of  the  3  billion  bases 
that  encode  all  heritable  information. 
During  the  20  years  since  effective 
methods  were  introduced  to  carry  out 
DNA  sequencing  by  biochemical  analy- 
sis of  recotnbinant-DNA  molecules, 
these  techniques  have  improved  dra- 
matically. In  the  late  1970s,  segments  of 
DNA  spatming  a  few  thousand  bases 
challenged  the  capacity  of  world-class 
sequencing  laboratories.  Now.  a  few 
million  base  pairs  per  year  represent 
state-of-the-art  output  for  a  single  se- 
quencing center. 

However,  the  Human  Genome  Project  is 
directed  toward  completing  the  human 
sequence  in  5  to  10  yean,  so  the  data 
must  be  acquired  with  technology  avail- 
able now.  This  goal,  while  cleariy  fea- 
sible, poses  substantial  organizational 
and  technical  challenges.  Organization- 
ally, genome  centers  must  begin  build- 
ing data-production  units  capable  of 
sustained,  cost-effective  operation. 
Technically,  many  incremental  rcfme- 
ments  of  current  technology  must  be  in- 
troduced, particulariy  those  that  remove 
impediments  to  iiKreasing  the  scale  of 
DNA  sequencing.  The  University  of 
Washington  (UW)  Genome  Center  is 
active  in  both  areas. 


The  HLA  locus  encodes  genes  that  must 
be  closely  matched  between  organ  donors 
and  organ  recipients.  This  sequence  data 
is  expected  to  lead  to  long-term  improve- 
ments in  the  ability  to  achieve  good 
matches  between  unrelated  organ  donors 
and  recipients. 

The  mouse  locus  that  encodes  compo- 
nents of  the  T-cell-receptor  family  is  of 
interest  for  several  reasons.  The  locus 
specifies  a  set  of  proteins  that  play  a 
critical  role  in  cell-mediated  immune  re- 
sponses. It  provides  sequence  data  that 
will  help  in  the  design  of  new  experi- 
mental approaches  to  the  study  of  immu- 
nity in  mice — one  of  the  most  important 
experimental  animals  for  immunological 
research.  In  addition,  the  locus  will  pro- 
vide one  of  the  first  large  blocks  of  DNA 
sequence  for  which  both  human  and 
mouse  versions  are  known. 

Human-mouse  sequence  comparisons 
provide  a  powerful  means  of  identifying 
the  most  important  biological  features  of 
DNA  sequence  because  these  features  are 
often  highly  conserved,  even  between 
such  biologically  different  organisms  as 
human  and  mouse.  Finally,  sequencing 
an  "anonymous"  region  of  human  chro- 
mosome ?.  a  region  about  which  little 
was  known  previously,  provides  experi- 
ence in  carrying  out  laige-scale  sequenc- 
ing under  the  conditions  that  will  prevail 
throughout  most  of  the  Human  Genome 
Project. 


UniifTjitj  of  Wa-shington 

(veROme  Center 
Department  of  Medldnc 
Box  352145 
Saltle,  WA  9«195 

Maynard  Olson 
Direct  (H* 

M6/ASS-7.Ui6.  Fax:  -7344 
mvo&u.  teaihingtoiv  edu 

For  more  information  on 
research  projects  and  investi- 
gators at  the  University  of 
Washington  Geoome  Center, 
see  abstracts  in  Part  2  of  this 
report  and  the  center's  Web 
site  (see  URL  above). 


Production  Sequencing 

Both  to  gain  experience  in  the  production 
of  high-quality,  low-cost  DNA  sequence 
and  to  generate  data  of  immediate  bio- 
logical interest,  the  center  is  sequencing 
several  regions  of  human  and  mouse 
DNA  at  a  current  throughput  of  2  mil- 
lion bases  per  year.  This  "production  se- 
quencing" has  three  major  targets:  the 
human  leukocyte  antigen  (HLA)  locus 
on  human  chromosome  6,  the  mouse  lo- 
cus encoding  the  alpha  subunit  of  T-cell 
receptors,  and  an  "anonymous"  region 
of  human  chromosome  7. 


Technology  for  Large- 
Scale  Sequencing 


In  addition  to  these  pilot  projects,  the 
UW  Genome  Center  is  developing  incre- 
mental improvements  in  current  sequenc- 
ing technology.  A  particular  focus  is  on 
enhanced  computer  software  to  process 
raw  data  acquired  with  automated  labora- 
tory instruments  that  are  used  in  DNA 
mapping  and  sequencing.  Advanced  in- 
stniroentation  is  commercially  available 
for  determining  DNA  sequence  via  the 
"four-color-fluorescence  method,"  and 
this  instrumentation  is  expected  to  carry 
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the  main  experimental  load  of  the  Human 
Genome  Project.  Raw  data  produced  by 
these  instruments,  however,  require  ex- 
tensive processing  before  they  are  ready 
for  biological  analysis. 

Large-scale  sequencing  involves  a  "divide- 
and-conquer"  strategy  in  which  the  huge 
DNA  molecules  present  in  human  cells 
are  broken  into  smaller  pieces  that  can  be 
propagated  by  recombinant- DNA 
methods.  Individual  analyses  ultimately 
are  carried  out  on  segments  of  less  than 
1 000  bases.  Many  such  analyses,  each  of 
which  still  contains  numerous  errors,  must 
be  melded  together  to  obtain  finished  se- 
quence. During  the  melding,  errors  in  in- 
dividual analyses  must  be  recognized  and 
corrected.  In  typical  large-scale  sequenc- 
ing projects,  the  results  of  thousands  of 
analyses  are  melded  to  produce  highly 
accurate  sequence  (less  than  one  error  in 
10,000  bases)  that  is  continuous  in 
blocks  of  100,000  or  more  bases.  The 
UW  Genome  Center  is  playing  a  major 
role  in  developing  software  that  allows  this 
process  to  be  carried  out  automatically 
with  little  need  for  expert  intervention. 
Software  developed  in  the  UW  center  is 
used  in  more  than  50  sequencing  laborato- 
ries around  the  world,  including  most  of 
the  latge-scale  sequencing  centers  produc- 
ing data  for  the  Human  Genome  Project. 

High-Resolution 
Physical  Mapping 

The  UW  Genome  Center  also  is  develop- 
ing improved  software  that  addresses  a 
higher-level  problem  in  large-scale  se- 
quencing. The  starting  point  for  large-scale 
sequencing  typically  is  a  recombinant- 
DNA  molecule  that  allows  propagation 
of  a  particular  human  genomic  segment 
spanning  50,000  to  200,000  bases. 
Much  effort  during  the  last  decade  has 
gone  into  the  physical  mapping  of  such 
molecules,  a  process  that  allows  huge 
regions  of  chromosomes  to  be  defmed 


in  terms  of  sets  of  overlapping 
recombinant-DNA  molecules  whose 
precise  positions  along  the  chromosome 
are  known.  However,  the  precision  re- 
quired for  knowing  relationships  of 
recombinant-DNA  molecules  derived 
from  neighboring  chromosomal  por- 
tions increases  as  the  Human  Genome 
Project  shifts  its  emphasis  from  map- 
ping to  sequencing. 

High-resolution  maps  both  guide  the  or- 
derly sequencing  of  chromosomes  and 
play  a  critical  role  in  quality  control. 
Only  by  mapping  recombinant-DNA 
molecules  at  high  resolution  can  subtle 
defects  in  particular  molecules  be  rec- 
ognized. Such  defective  human  DNA 
sources,  which  are  not  faithful  replicas 
of  the  human  genome,  must  be  weeded 
out  before  sequencing  can  begin.  The 
UW  Genome  Center  has  a  major  program 
in  high-resolution  physical  mapping 
which,  like  the  work  on  sequencing  it- 
self, uses  advanced  computing  tools. 
The  center  is  producing  maps  of  regions 
targeted  for  sequencing  on  a  just-in- 
time  basis.  These  highly  detailed  maps 
are  proving  extremely  valuable  in  fa- 
cilitating the  production  of  high-quality 
sequence. 

Ultimate  Goal 

Although  many  challenges  currently 
posed  by  the  Human  Genome  Project 
are  highly  technical,  the  ultimate  goal  is 
biological.  The  project  will  deliver 
immense  amounts  of  high-quality, 
continuous  DNA  sequence  into  pub- 
licly accessible  databases.  These  data 
will  be  annotated  so  that  biologists  who 
use  them  will  know  the  most  likely 
positions  of  genes  and  have  convenient 
access  to  the  best  available  clues  about 
the  probable  function  of  these  genes. 
The  better  the  technical  solutions  to  cur- 
rent challenges,  the  better  the  center 
will  be  able  to  serve  future  users  of  the 
hiunan  genome  sequence. 
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Research  Narratives 

Genome  Database 


The  release  of  Version  6  of  the 
Genome  Database  (GDB)  in 
January  1996  signaled  a  ma- 
jor change  for  both  the  scien- 
tific community  and  GDB 
staff.  GDB  6.0  introduced  a  number  of 
significant  improvements  over  previous 
versions  of  GDB,  most  notably  a  revised 
data  representation  for  genes  and  ge- 
nomic maps  and  a  new  curatorial  model 
for  the  database.  These  new  features, 
along  with  a  remodeled  database  structure 
and  new  schema  and  user  interface,  pro- 
vide a  resource  with  the  potential  to  inte- 
grate alt  scientific  information  currently 
available  on  human  genomics.  GDB  rap- 
idly is  becoming  the  international  biomedi- 
cal research  community's  central  source 
for  information  about  genomic  structure. 
content,  diversity,  and  evolution. 

A  New  Data  Model 

Inherent  in  the  underlying  organization  of 
information  in  GDB  is  an  improved 
model  for  genes,  maps,  and  other  classes 
of  data.  In  particular,  genomic  segments 
(any  named  region  of  the  genome)  and 
maps  are  being  expanded  regularly.  New 
segment  types  have  been  added  to  support 
ti)e  integration  of  mapping  and  sequencing 
data  (for  exanq}le,  gene  elements  and  n:- 
peals)  and  the  construction  of  comparative 
maps  (syntenic  regions).  New  map  types 
include  comparative  maps  for  represent- 
ing conserved  syntenies  between  species 
and  comprehensive  maps  that  combine 
data  from  all  the  various  submitted  maps 
within  GDB  to  provide  a  single  integrated 
view  of  the  genome.  Experimental  obser- 
vations such  as  order,  size,  distance,  and 
chimerism  are  also  available. 

Through  the  Worid  Wide  Web.  GDB  links 
its  stored  data  with  many  other  biological 
resources  on  the  Internet.  GDB's  External 
Link  category  is  a  growing  collection  of 
cross-references  established  between 
GDB  entities  and  related  information  in 
other  databases.  By  providing  a  place  for 
these  cross-references,  GDB  can  serve  as 
a  central  point  of  inquiry  into  technical 
data  regarding  human  genomics. 


Direct  Comraimity 
Data  Submission  and 
Curation 

Two  methods  for  data  submission  are  in 
use.  For  individuals  submitting  small 
amounts  of  data,  interactive  editing  of 
the  database  through  the  Web  became 
available  in  April  1996.  and  the  process 
has  undergone  several  simplifications 
since  that  time.  This  continues  to  be  an 
area  of  development  for  GDB  because 
all  editing  must  take  place  at  the  Balti- 
more site,  and  Internet  connections 
from  outside  North  America  may  be  too 
slow  for  interactive  editing  to  be  practi- 
cal. Until  these  difficulties  are  resolved, 
GDB  encourages  scientist  with  limited 
connectivity  to  Baltimore  to  submit 
their  dau  via  more  traditional  means 
(e-mail,  fax,  mail,  phone)  or  to  prepare 
electronic  submissions  for  entry  by  the 
data  group  on  site. 

For  centers  submitting  large  quantities 
of  data,  GDB  developed  an  electronic 
data  submission  (EDS)  tool,  which  pro- 
vides the  means  to  specify  login  pass- 
word validation  and  commands  for 
inserting  and  updating  data  in  GDB. 
The  EDS  syntax  includes  a  mechanism 
for  relating  a  center's  local  naming  con- 
ventions to  GDB  objects.  Data  submit- 
ted to  GDB  may  be  stored  privately  for 
up  to  6  months  before  it  automatically 
becomes  public.  The  database  is  pro- 
grammed to  enforce  this  Human  Genonte 
Project  policy.  Detailed  specifications 
of  GDB's  EDS  syntax  and  other  sub- 
mission instructions  are  available  (EDS 
prototype,  hnp:/Avww.gdb.org/eds). 

Since  the  EDS  system  was  imple- 
mented, GDB  has  put  forth  an  aggres- 
sive effort  to  increase  the  amount  of 
data  stored  in  the  database.  Conse- 
quently, the  database  has  grown  tremen- 
dously. During  19%  it  grew  from  1.8  to 
6.7  gigabytes. 

To  fnovide  accountability  regarding  data 
quality,  (he  shift  to  community  curation 
introduced  the  idea  that  individuals  and 


Genome  Databa.sc 
Johns  Hopktn.s  I'niventUy 
2024  E.  Monument  Street 
Baltimore,  MD  21205-223^ 

St  a  tile}  Letovsky 
Informutics  Dirrctor 

Robert  C  otUnghant 
Operations  Director 

Telephone  for  both:  4I0''955-970S 
Fix  for  both:  41(V614-0434 

David  Kingsbur> 
Director,  1993-97* 


In  lieu  of  individual  abstracts, 
research  projects  and  investi- 
gators at  GDB  are  represented 
in  this  narrative.  More  infor- 
mation can  be  found  on  GDB's 
Web  site  (see  URL  above). 


*Now  at  Chiron  Phaimaccuti- 
cals.  Emetyvillc  Califomia 
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laboratories  own  the  data  they  submit  to 
GDB  and  that  other  researchers  cannot 
modify  it.  However,  others  should  be 
able  to  add  infoimation  and  comments, 
so  an  additional  feature  is  the  commu- 
nity's ability  to  conduct  electronic 
online  public  discussions  by  annotating 
the  database  submissions  of  fellow  re- 
searchers. GDB  is  the  first  database  of 
its  kind  to  offer  this  feature,  and  the 
number  of  third-party  aimotations  is 
increasing  in  the  form  of  editorial  com- 
mentary, links  to  literature  citations,  and 
links  to  other  databases  external  to 
GDB.  These  links  are  an  important  part 
of  the  curatorial  process  because  they 
make  other  data  collections  available  to 
GDB  users  in  an  appropriate  context. 

Improved  Map 
Representation 
and  Querying 

Accompanying  the  release  of  GDB  6.0, 
the  program  Mapview  creates  graphical 
displays  of  maps.  Mapview  was  devel- 
oped at  GDB  to  display  a  number  of 
map  types  (cytogenetic,  radiation  hybrid, 
contig,  and  linkage)  using  common 
graphical  conventions  found  in  the  lit- 
erature. Mapview  is  designed  to  stand 
alone  or  to  be  used  in  conjunction  with 
a  Web  browser  such  as  Netscape,  thereby 
creating  an  interactive  graphical  display 
system.  When  used  with  Netscape, 
Mapview  allows  the  user  to  retrieve  de- 
tails about  any  displayed  map  object. 

Maps  are  accessed  through  the  query 
form  for  genomic  segment  and  its  sub- 
classes via  a  special  program  that  al- 
lows the  user  to  select  whole  maps  or 
slices  of  maps  from  specific  regions  of 
interest  and  to  query  by  map  type.  The 
ability  to  browse  maps  stored  in  GDB 
or  download  them  in  the  background 
was  also  incorporated  into  GDB  6.0. 

GDB  stores  many  maps  of  each  chro- 
mosome, generated  by  a  variety  of  map- 
ping methods.  Users  who  are  interested 


in  a  region,  such  as  the  neighborhood  of 
a  gene  or  marker,  will  be  able  to  see  all 
maps  that  have  data  in  that  region, 
whether  or  not  they  contain  the  desired 
marker.  To  support  database  querying 
by  region  of  interest,  integrated  maps 
have  been  developed  that  combine  data 
from  all  the  maps  for  each  chromosome. 
These  are  called  Comprehensive  Maps. 

Queries  for  all  loci  in  a  region  of  inter- 
est are  processed  against  the  compre- 
hensive maps,  thereby  searching  all 
relevant  maps.  Comprehensive  maps  are 
also  useful  for  display  purposes  because 
they  organize  the  content  of  a  region  by 
class  of  locus  (e.g.,  gene,  marker,  clone) 
rather  than  by  data  source.  This  approach 
yields  a  much  less  complex  presentation 
than  an  alignment  of  numerous  primary 
maps.  Because  such  information  as  de- 
tailed orders,  order  discrepancies  be- 
tween maps,  and  nonlinear  metric 
relations  between  maps  is  not  always 
captured  in  the  comprehensive  maps, 
GDB  continues  to  provide  access  to 
aligned  displays  of  primary  maps. 

A  Variety  of  Searching 
Strategies 

Recognizing  the  eclectic  user  commu- 
nity's need  to  search  data  and  formulate 
queries,  GDB  offers  a  spectrum  of 
simple  to  complex  search  strategies.  In 
addition,  direct  programming  access  is 
available  using  either  GDB's  object 
query  language  to  the  Object  Broker 
software  layer  or  standard  query  lan- 
guage to  the  underlying  Sybase  rela- 
tional database. 

Querying  by  Object  Directly 
from  GDB's  Home  Page 

The  simplest  methods  search  for  objects 
according  to  known  GDB  accession 
numbers;  sequence  database-accession 
numbers;  specified  names,  including 
wiidcanl  symbols  that  will  automatically 
match  synonyms  and  primary  names;  and 
keywords  contained  anywhere  in  the  text. 
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Querying  by  Region  of  Interest       Work  in  Progress 


A  region  of  interest  can  be  specified  us- 
ing a  pair  of  flanking  markers,  which 
can  be  cytogenetic  bands,  genes, 
ampUmers  (sequence  tagged  sites),  or 
any  other  mapped  objects.  Given  a  re- 
gion of  interest,  the  comprehensive 
maps  are  searched  to  find  all  loci  that 
fall  within  them.  These  loci  can  be  dis- 
played in  a  table,  graphically  as  a  slice 
through  a  comprehensive  map,  or  as 
slices  through  a  chosen  set  of  primary 
maps.  A  comprehensive  map  slice 
shows  all  loci  in  the  region,  including 
genes,  expressed  sequence  tags  (ESTs), 
ampUmers,  and  clones.  A  region  also 
can  be  specified  as  a  neighborhood 
around  a  single  marker  of  interest. 

Results  of  queries  for  genes,  amplimers, 
ESTs,  or  clones  can  be  displayed  on  a 
GDB  comprehensive  map.  Results  are 
spread  across  several  chromosomes  dis- 
played in  Mapview  (see  figure,  p.  52).  A 
query  for  all  the  PAX  genes  (specified 
as  symbol  =  PAX*  on  the  gene  query 
form)  retrieves  genes  on  multiple  chro- 
mosomes. Double-clicking  on  one  of 
these  genes  brings  up  detailed  gene  in- 
formation via  the  Web  browser. 

Querying  by  Polymorphism 

GDB  contains  a  large  niunber  of  poly- 
morphisms associated  with  genes  and 
other  markers.  Queries  can  be  con- 
structed for  a  particular  type  of  marker 
(e.g.,  gene,  amplimer,  clone),  polymor- 
phism (i.e.,  dinucleotide  repeat),  or 
level  of  heterozygosity.  These  queries 
can  be  combined  with  positional  queries 
to  find,  for  example,  polymorphic 
amplimers  in  a  region  bounded  by 
flanking  markers  or  in  a  particular  chro- 
mosomal band.  If  desired,  the  retrieved 
markers  can  be  viewed  on  a  comprehen- 
sive map. 


Mapview  23 

Mapview  2. 1 ,  the  next  generation  of  the 
GDB  map  viewer,  was  released  in 
March  1997.  The  latest  version, 
Mapview  2.3,  is  available  in  all  com- 
mon computing  environments  because 
it  is  written  in  the  Java  programming 
language.  Most  important,  the  new 
viewer  can  display  multiple  aligned 
maps  side  by  side  in  the  window,  with 
alignment  lines  indicating  common 
markers  in  neighboring  maps.  As  be- 
fore, users  can  select  individual  markers 
to  retrieve  more  information  about  them 
from  the  database. 

GDB  developers  have  entered  into  a 
collaborative  relationship  with  other 
members  of  the  bio  Widget  Consortium 
so  the  Java-based  alignment  viewer  will 
become  part  of  a  collection  of  freely 
available  software  tools  for  displaying 
biological  data  (http://goodman.jax.org/ 
project.i/biowidget.'i/consortium). 

Future  plans  for  Mapview  include  pro- 
viding or  enhancing  the  ability  to  gener- 
ate manuscript-ready  Postscript  map 
images,  highlight  or  modify  the  display 
of  particular  classes  of  map  objects 
based  on  attribute  values,  and  requery 
for  additional  information. 

Variation 

Since  its  inception,  GDB  has  been  a  re- 
pository for  polymorphism  data,  with 
more  than  18, (KM)  polymorphisms  now 
in  GDB.  A  collaboration  has  been  initi- 
ated with  the  Human  Gene  Mutation 
Database  (HGMD)  based  in  Cardiff, 
Wales,  and  headed  by  David  Cooper 
and  Michael  Krawczak.  HGMD's  ex- 
tensive collection  of  human  mutation 
data,  covering  many  disease-causing 
loci,  includes  sequence -level  mutation 
characterizations.  This  data  set  will  be 
included  in  GDB  and  updated  from 
HGMD  on  an  ongoing  basis.  The 
HGMD  team  also  will  provide  advice 


OOE  Human  Genome  Prtigram  Report.  GOB  >S5^  ' 


184 


OD  GDB's  representation  of  geneuc 
variation,  which  is  being  enhanced  to 
model  mutations  and  polymorphisms  at 
the  sequence  level.  These  modifications 
will  allow  GDB  to  act  as  a  repository 
for  single- nucleotide  polymorphisms, 
which  are  expected  to  be  a  major  source 
of  information  on  human  genetic  varia- 
tion in  the  near  future. 


Mouse  Synteny 

Genomic  relationships  between  mouse 
and  man  provide  important  clues  regard- 
ing gene  location,  phenotype.  and  fuiK- 
tion  (see  figure,  p.  53).  One  of  GDB's 
goals  is  to  enable  direct  comparisons  be- 
tween these  two  organisms,  in  collabora- 
tion with  the  Mouse  Genome  Database 
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at  Jackson  Laboratory.  GDB  is  making 
additions  to  its  schema  to  represent  this 
information  so  that  it  can  be  displayed 
graphically  with  Mapview.  In  addition, 
algorithmic  work  is  under  way  to  use 
mapping  data  to  automatically  identify 
regions  of  conserved  synteny  between 
mouse  and  man.  These  algorithms  will 
allow  the  synteny  maps  to  be  updated 
regularly.  An  important  application  of 
comparative  mapping  is  the  ability  to 
predict  the  existence  and  location  of  un- 
known human  homologs  of  known, 
mapped  mouse  genes.  A  set  of  such  pre- 
dictions is  available  in  a  report  at  the 
GDB  Web  site,  and  similar  data  will  be 
available  in  the  database  itself  in  the 
spring  of  1998. 

Collaborations 

GDB  is  a  participant  in  the  Genome 
Annotation  Consortium  (GAC)  project, 
whose  goal  is  to  produce  high-quality, 
automatic  annotation  of  genomic  se- 
quences (hnp://compbio.oml.gov/ 
CoLab).  Currently,  GDB  is  developing 
a  prototype  mechanism  to  transition 
from  GDB's  Mapview  display  to  the 
GAC  sequence-level  browser  over 
common  genome  regions.  GAC  also 
will  establish  a  human  genome  refer- 
ence sequence  that  will  be  the  base 
against  which  GDB  will  refer  all  poly- 
morphisms and  mutations.  Ultimately, 
every  genomic  object  in  GDB  should  be 
related  to  an  appropriate  region  of  the 
reference  sequence. 

Sequencing  Progress 

The  sequencing  status  of  genomic  re- 
gions now  can  be  recorded  in  GDB. 


Based  on  submissions  to  sequence  data- 
bases, GAC  will  determine  genomic  re- 
gions that  have  been  completed.  GDB 
also  will  be  collaborating  with  the  Euro- 
pean Bioinformatics  Institute,  in  con- 
junction with  the  international  Human 
Genome  Organisation  (HUGO),  to 
maintain  a  single  shared  Human  Se- 
quence Index  that  will  record  commit- 
ments and  status  for  sequencing  clones 
or  regions.  As  a  result,  the  sequencing 
status  of  any  region  can  be  displayed 
alongside  other  GDB  mapping  data. 

Outreach 

The  Genome  Database  continues  to 
seek  direct  community  feedback  and  in- 
teract with  the  broader  science  commu- 
nity via  various  sources: 

•  International  Scientific  Advisory 
Committee  meets  annually  to  offer 
input  and  advice. 

•  Quarterly  Review  Committee  confers 
frequently  with  the  staff  to  track 
GDB  progress  and  suggest  change. 

•  HUGO  nomenclature,  chromosome, 
and  other  editorial  committees  have 
specialized  functions  within  GDB, 
providing  official  names  and  consen- 
sus maps  and  ensuring  the  high  qual- 
ity of  GDB's  content. 

Copies  of  GDB  are  available  worldwide 
firom  ten  mirror  sites  (nodes)  that  make 
the  data  more  easily  accessible  to  the  in- 
ternational research  community.  GDB 
staff  meet  annually  with  node  managers 
to  facilitate  interaction  and  to  benefit 
from  other  user  perspectives. 
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Research  Narratives 
National  Center  for  Genome  Resources 


The  National  Center  for 
Genome  Resources 
(NCGR)  is  a  not-for- 
profit  organization  cre- 
ated to  design,  develop, 
support,  and  deliver  resources  in  sup- 
port of  public  and  private  genome  and 
genetic  research.  To  accomplish  these 
goals,  NCGR  is  developing  and  publish- 
ing the  Genome  Sequence  DataBase 
(GSDB)  and  the  Genetics  and  Public 
Issues  tGPI)  program. 

NCGR  is  a  center  to  facilitate  the  flow 
of  information  and  resources  ftx)ra  ge- 
nome projects  into  both  public  and  pri- 
vate sectors.  A  broadly  based  board  of 
governors  provides  direction  and  strat- 
egy for  the  center's  development. 

NCGR  opened  in  Santa  Fe  in  July  1994. 
with  its  initial  bioioformatics  work 
being  developed  through  a  coopera- 
tive 5-year  agreement  with  the  Depart- 
ment of  Energy  funded  in  July  1995. 
Committed  to  serving  as  a  resource  for 
all  genomic  research,  the  center 
works  collaboratively  with  researchers 
and  seeks  input  from  users  to  ensure 
that  tools  and  projects  under  develop- 
ment meet  their  needs. 

Genome  Sequence 
DataBase 

GSDB  is  a  relational  database  that  con- 
tains nucleotide  sequence  data  (see  pie 
chart)  and  its  associated  annotation 
from  all  known  organisms  (http:// 
www.ncgr.org/gsdb).  All  data  are  freely 
available  to  the  public.  The  major  goals 
of  GSDB  are  to  provide  the  support 
structure  for  storing  sequence  data  and 
to  fiimish  useful  data-retrieval  services. 

GSDB  adheres  to  the  philosophy  that 
the  database  is  a  "community-owned" 
resource  that  should  be  simple  to  update 
to  reflect  new  discoveries  about  se- 
quences. A  corollary  to  this  is  GSDB's 
conviction  that  researchers  know  their 
areas  of  expertise  much  better  than  a 
database  curator  and,  therefore,  tbey 


should  be  given  ownership  and  control 
over  the  data  they  submit  to  the  data- 
base. The  true  role  of  the  GSDB  staff  is 
to  help  researchers  submit  data  to  and 
retrieve  data  from  the  database. 

GSDB  Enhancements 

During  1996,  GSDB  underwent  a  major 
renovation  to  support  new  data  types 
and  concepts  that  are  important  to  ge- 
nomic research.  Tables  within  the  data- 
base were  restructured,  and  new  tables 
and  data  fields  were  added.  Some  key 
additions  to  GSDB  include  die  support 
of  data  ownership,  sequence  align- 
ments, and  discontiguous  sequences. 

The  concept  of  data  ownership  is  a  cor- 
nerstone to  the  functioning  of  the  new 
GSDB.  Every  piece  of  data  (e.g..  se- 
quence or  feature)  within  the  database  is 
owned  by  the  submitting  researcher,  and 
changes  can  be  made  only  by  the  data 
owner  or  GSDB  staff.  This  implementa- 
tion of  data  ownership  provides  GSDB 
with  the  ability  to  support  community 
(third-party)  aimotation — the  addition 
of  annotation  to  a  sequence  by  other 
conununity  researchers. 


Genome  Sequence  DataBu'W: 
1  mo  Old  Pectis  Trail.  Suite  A 
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Peter  Schad 

Vice-President,  Bioiufonuatics 

and  Biotechnology 
505/995-4447,  Fax:  -4432 
cnc®ncgr.org 

CbtxA  Harger 
GSDB  Manager 
5ft5.'"982.784fl,  Fax:  -7690 
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In  lieu  of  individual  abstracts, 
research  projects  and  investi- 
gators at  NCGR  are  repre- 
sented in  this  narrative.  More 
information  can  be  found  on 
the  center's  Web  site  (see  URL 
above). 
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A  second  enhancement  of  GSDB  is  the 
ability  to  store  and  represent  sequence 
alignments.  GSDB  staff  has  been  con- 
structing alignments  to  several  key  se- 
quences including  the  env  and  pol 
(reverse  transcriptase)  genes  of  the  HIV 
genome,  the  complete  chromosome  VIII 
of  Saccharomyces  cerevisiae,  and  the 
complete  genome  of  Haemophilus 
influenzae.  These  alignments  are  useful 
as  possible  sites  of  biological  interest  and 
for  rapidly  identifying  differences  be- 
tween sequences. 

A  third  key  GSDB  enhancement  is  the 
ability  to  represent  known  relationships 
of  order  and  distance  between  separate 
individual  pieces  of  sequence.  These 
sets  of  sequences  and  their  relative  posi- 
tions are  grouped  together  as  a  single 
discontiguous  sequence.  Such  a  sequence 
may  be  as  simple  as  two  primers  that  de- 
fine the  ends  of  a  sequence  tagged  site 
(STS),  it  may  comprise  all  exons  that  are 
part  of  a  single  gene,  or  it  may  be  as 
complex  as  the  STS  map  for  an  entire 
chromosome. 

GSDB  staff  has  constructed  discontigu- 
ous sequences  for  human  chromosomes  I 
through  22  and  X  that  include  markers 
from  Massachusetts  Institute  of  Technol- 
ogy-Whitehead  Institute  STS  maps  and 
from  the  Stanford  Human  Genome  Cen- 
ter. The  set  of  2000  STS  markers  for 
chromosome  X,  which  were  mapped  re- 
cently by  Washington  University  at 
St.  Louis,  also  have  been  added  to  chro- 
mosome X.  About  50  genomic  sequences 
have  been  added  to  the  chromosome  22 
map  by  determining  their  overlap  with 
STS  markers.  Genomic  sequences  are 
being  added  to  all  the  chromosomes  as 
their  overlap  with  the  STS  markers  is 
determined.  These  discontiguous  se- 
quences can  be  retrieved  easily  and 
viewed  via  their  sequence  names  using 
the  GSDB  Annotator  Sequence  names 
follow  the  format  of  HUMCHR#MP, 
where  #  equals  1  through  22  or  X. 

GSDB  staff  also  has  utilized  discontigu- 
ous sequences  to  construct  maps  for 
maize  and  rice.  The  maize  discontiguous 
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sequences  were  constructed  using  mark- 
ers from  the  University  of  Missouri, 
Columbia.  Markers  for  the  rice 
discontiguous  sequence  were  obtained 
from  the  Rice  Genome  Database  at 
Cornell  University  and  the  Rice  Ge- 
nome Research  Project  in  Japan. 

New  Tools 

As  a  result  of  the  major  GSDB  renova- 
tion, new  tools  were  needed  for  submit- 
ting and  accessing  database  data. 
Annotator  was  developed  as  a  graphical 
interface  that  can  be  used  to  view,  up- 
date, and  submit  sequence  data  (http:// 
www.ncgr.org/gsdb/beta.html).  Maestro, 
a  Web-based  interface,  was  developed 
to  assist  researchers  in  data  retrieval 
(http://www.ncgr.org/gsdb/maestrobeta. 
html).  Although  both  these  tools  cur- 
rently are  available  to  researchers, 
GSDB  is  continuing  development  to 
add  increased  capabilities. 

Annotator  displays  a  sequence  and  its 
associated  biological  information  as  an 
image,  with  the  scale  of  the  image  ad- 
justable by  the  user.  Additional  informa- 
tion about  the  sequence  or  an  associate 
biological  feature  can  be  obtained  in  a 
pop-up  window.  Annotator  also  allows  a 
user  to  retrieve  a  sequence  for  review, 
edit  existing  data,  or  add  armotation  to 
the  record.  Sequences  can  be  created  us- 
ing Annotator,  and  any  sequences  cre- 
ated or  edited  can  be  saved  either  to  a 
local  file  for  later  review  and  further  ed- 
iting or  saved  directly  to  the  database. 

Correct  database  structures  are  impor- 
tant for  storing  data  and  providing  the 
research  community  with  tools  for 
searching  and  retrieving  data.  GSDB  is 
making  a  concerted  effort  to  expand  and 
improve  these  services.  The  first  gen- 
eration of  the  Maestro  query  tool  is 
available  from  the  GSDB  Web  pages. 
Maestro  allows  researchers  to  perform 
queries  on  18  different  fields,  some  of 
which  are  queryable  only  through 
GSDB,  for  example,  D  segment  num- 
bers from  the  Genome  Database  at 
Johns  Hopkins  University  in  Baltimore. 


189 


Additionally,  Maestro  allows  queries 
with  mixed  Boolean  operators  for  a 
more  refined  search.  For  example,  a 
user  may  wish  to  compare  relatively 
long  mouse  and  human  sequences  that 
do  not  contain  identified  coding  re- 
gions. To  obtain  all  sequences  meeting 
these  criteria,  the  scientific  name  field 
would  be  searched  first  for  "Mus  mus- 
culus"  and  then  for  "Homo  sapiens"  us- 
ing the  Boolean  term  "OR."  Then  the 
sequence -length  filter  could  be  used  to 
refine  the  search  to  sequences  longer 
than  10,000  base  pairs.  To  exclude  se- 
quences containing  identified  coding-re- 
gion features,  the  "BUT  NOT'  term  can 
be  used  with  the  Feature  query  field  set 
equal  to  "coding  region." 

With  Maestro,  users  can  view  the  list  of 
search  matches  a  few  at  a  time  and  re- 
trieve more  of  the  list  as  needed.  From 
the  list,  users  can  select  one  or  several 
sequences  according  to  their  short  de- 
scriptions and  review  or  download  the 
sequence  information  in  GIO,  FASTA, 
or  GSDB  flatfile  format 

Future  Plans 

Although  most  pieces  necessary  for  op- 
eration are  now  in  place,  GSDB  is  still 
improving  functionality  and  adding  en- 
hancements. During  the  next  year 
GSDB,  in  collaboration  with  other  re- 
searchers, anticipates  creating  more 
discontiguous  sequence  maps  for  sev- 
eral model  organisms,  adding  more 
functionality  to  and  providing  a  Web- 
based  submission  tool  and  tool  kit  for 
creating  GIO  files. 


the  organism  being  sequenced,  sequenc- 
ing groups  involved,  background  infor- 
mation on  the  organism,  and  its  current 
location  on  the  Carl  Woese  Tree  of  Life. 
As  the  Microbial  Genome  Project 
progresses,  the  pages  will  be  updated  as 
appropriate. 

Genetics  and  Public 
Issues  Program 

GPI  serves  as  a  crucial  resource  for 
people  seeking  information  and  making 
decisions  about  genetics  or  genomics 
(http://www.ncgr.org/gpi).  GPI  develops 
and  provides  information  that  explains 
the  ethical,  legal,  policy,  and  social  rel- 
evance of  genetic  discoveries  and  appli- 
cations. 

To  achieve  its  mission,  GPI  has  set  forth 
three  goals:  (1)  preparation  and  devel- 
opment of  resources,  including  carefiil 
delineation  of  ethical,  legal,  policy,  and 
social  issues  in  genetics  and  genomics; 
(2)  dissemination  of  genetic  information 
targeted  to  the  public,  legal  and  health 
professionals,  policymakers,  and  deci- 
sion makers;  and  (3)  creation  of  an  in- 
formation network  to  faciUtate 
interaction  among  groups. 

GPI  delivers  information  through  four 
primary  vehicles:  online  resources,  con- 
ferences, publications,  and  educational 
programs.  The  GPI  program  maintains  a 
continually  evolving  World  Wide  Web 
site  containing  a  range  of  material 
freely  accessible  over  the  Internet. 


Microbial  Genome 
Web  Pages 

NCGR  also  maintains  informational 
Web  pages  on  microbial  genomes. 
These  pages,  created  as  a  community 
reference,  contain  a  list  of  current  or 
completed  eubacterial,  Archaeal,  and 
eukaryotic  genome  sequencing  projects. 
Each  main  page  includes  the  name  of 
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Program  Management 


The  Human  Genome  Program 
was  conceived  in  1986  as  an 
initiadve  within  the  DOE  Of- 
fice of  Health  and  Environ- 
mental Research,  which  has 
been  renamed  Office  of  Biological  and 
Environmental  Research  (OBER)  (see 
chart  below).  The  program  is  administered 
primarily  through  the  OBER  Health  Ef- 
fects and  Life  Sciences  Research  Division 
(HELSRD),  both  directed  by  David  A. 
Smith  until  his  retirement  in  January 
1996.  Marvin  Frazier  is  now  Director  of 
HELSRD,  and  OBER  is  led  by  Associ- 
ate Director  Aristides  Patrinos.  who  also 
serves  as  Human  Genome  Program 
manager.  Previous  directors  and  manag- 
ers are  listed  in  the  Ubie  below.  OBER 
is  within  the  Office  of  Energy  Research, 
directed  by  Martha  Krebs. 


DOE  OBER  Mission 

Based  on  mandates  from  Congress. 
DOE  OBER's  principal  missions  are  to 
(1)  develop  the  knowledge  necessary  to 
identify,  understand,  and  anticipate 
long-term  health  and  environmental 
consequences  of  energy  use  and  devel- 
opment and  (2)  employ  DOE's  unique 
scientific  and  technological  capabilities 
in  solving  major  scientific  problems  in 
medicine,  biology,  and  the  environment 

Genome  integrity  and  radiation  biology 
have  been  a  long-term  concern  of 
OBER  at  IX)E  and  its  predecessors — 
the  Atomic  Energy  Commission  (AEC) 
and  the  Energy  Research  and  Develop- 
ment Administration  (ERDA).  In  the 
United  States,  the  first  federal  support 
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for  genetic  research  was  through  AEC,  In 
the  early  days  of  nuclear  energy  develop- 
ment, the  focus  was  on  radiation  effects 
and  broadened  later  under  ERDA  and 
DOE  to  include  health  implications  of  all 
energy  technologies  and  their  by-products. 

Today,  extensive  OBER-sponsored  re- 
search programs  on  genomic  structure, 
maintenance,  damage,  and  repair  con- 
tinue at  the  national  laboratories  and  uni- 
versities. These  and  other  OBER 
efforts  support  a  DOE  shift  toward  a  pre- 
ventive approach  to  health,  environment, 
and  safety  concerns.  World-class  scien- 
tists in  top  facilities  working  on  leading- 
edge  problems  spawn  the  knowledge  to 
revolutionize  the  technology,  drive  the 
future,  and  add  value  to  the  U.S. 
economy.  Major  OBER  research  includes 
characterization  of  DNA  repair  genes  and 
improvement  of  methodologies  and  re- 
sources for  quantifying  and  characteriz- 
ing genetic  polymorphisms  and  their 
relationship  to  genetic  susceptibilities. 

To  carry  out  its  national  research  and  de- 
velopment obligations,  OBER  conducts 
the  following  activities: 

•  Sponsors  peer-reviewed  research  and 
development  projects  at  universities, 
in  the  private  sector,  and  at  DOE  na- 
tional laboratories  (see  box,  p.  59). 

•  Considers  novel,  beneficial  initiatives 
with  input  from  the  scientific  commu- 
nity and  governmental  sectors. 

•  Provides  expertise  to  various  govern- 
mental working  groups. 

•  Supports  the  capabilities  of  multi- 
disciplinary  DOE  national  laborato- 
ries and  their  unique  user  facilities 
for  the  nation's  benefit  (p.  61). 

Human  Genome  Program  resources  and 
technologies  are  focused  on  sequencing 
the  human  genome  and  related  infor- 
matics and  supportive  infrastructure  (see 
chart  and  tables,  p.  62).  The  genomes  of 
selected  microorganisms  are  analyzed 
under  the  separate  Microbial  Genome 
Program. 
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Major  DOE  User  Facilities  and  Resources 
Relevant  to  Molecular  Biology  Research 


Although  the  genome  program  Ls  contributing  fundame mat  information  about  the  structure  of  chromoxomes 
and  genes,  other  types  of  knowledge  are  required  to  understand  how  genes  and  their  products  functitm.  Three- 
dimensional  protein  structure  studies  are  still  essential  because  structure  cannot  be  predicted  fully  from  its 
encoded  DNA  sequence. 

To  enhance  these  and  other  studies,  DOK  builds  and  maintains  structural  biology  user  facilities  that  enable 
scientists  to  gain  an  understanding  of  relationships  between  biological  structures  and  their  functions,  study 
disease  processes,  develop  new  pharmaceuticals,  and  conduct  basic  research  in  molecular  biology  and 
environmental  processes.  These  resources  are  used  heavily  by  both  academic  and  private-sector  scientists. 

Other  important  resources  available  to  the  research  community  include  the  clone  libraries  developed  in  the 
National  laboratory  Ocne  lihrary  Project  and  distrihuifd  worldwide,  the  GkAll,  Online  Sequence 
Interpretation  .Service,  and  the  Mouse  Genetics  Research  Facility. 


Argonne  National  Laboratory 
Advanced  Photon  Source 

Brookhaven  National  Laboratory 
High-Flux  Beam  Reactor 
National  Synchrotron  Light  Source 
Protein  Structure  Data  Bank 
Scanning  Transmission  Electron  IMIcroscope 

Lawrence  Berkeley  Itlational  Laboratory 
Advanced  Light  Source 
Center  for  X-Ftay  Optics 
National  Energy  Research  Scientific  Computing  Center 

Lawrence  Livermore  National  Laboratory 
National  Laboratory  Gene  Library  Project 


Los  Alamos  National  Laboratory 

National  Flow-Cytometry  Resource 
National  Laboratory  Gene  Library  Project 
Neutron-Scattering  Center 

Oak  Ridge  National  Laboratory 

GRAIL,  Online  Sequence  Interpretation  Service 
Mouse  Genetics  Research  Facility 

Pacific  Northwest  National  Laboratory 

Environmental  Molecular  Sciences  Laboratory 

Stanford  University 

Synchrotron  Radiation  Laboratory 
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Operating  Expenditures  and  FY  1998  Projected  Budget 
(or  the  DOE  Human  Genome  Program 


Coordination  and  Resources 

Program  coordination  is  the  responsibility  of  the  Human  Genome  Task  Group  (see 
box,  p.  60),  which,  beginning  in  1997,  includes  Elbert  Branscomb,  the  Joint  Genome 
Institute's  Scientific  Director.  The  task  group  is  aided  by  the  Biotechnology  Consor- 
tium (which  succeeded  the  former  Human  Genome  Coordination  Committee;  see 
box,  p.  60)  to  foster  information  exchange  and  dissemination.  The  task  group  admin- 
isters the  DOE  Human  Genome  Program  and  iLs  evolving  needs  and  reports  to  the 

Associate  Director  for  Biological  and 
Environmental  Research  (currently 
Aristides  Patrinos).  The  task  group  ar- 
ranges periodic  workshops  and  coor- 
dinates site  reviews  for  genome 
centers,  the  Joint  Genome  Institute, 
databases,  and  other  large  projects.  It 
also  coordinates  peer  review  of  research 
proposals,  administration  of  awards,  and 
collaboration  with  all  concerned  agen- 
cies and  organizations. 

The  Biotechnology  Consortium  pro- 
vides the  OBER  Associate  Director  with 
external  experti.se  in  all  aspects  of  ge- 
nomics and  informatics  and  a  mecha- 
nism by  which  OBER  can  keep  track  of 
the  latest  developments  in  the  field.  It 
facilitates  development  and  dissemination 
of  novel  genome  technologies  through- 
out the  DOE  system,  ensures  appropri- 
ate management  and  sharing  of  data  and 
resources  by  all  DOE  contractors  and 
grantees,  and  promotes  interactions  with 
other  national  and  international  ge- 
nomic entities. 
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Human  Genome  Program  Fiscal  Year  Expenditures  <SM) 


Yaar                      Operating 

Capital  Equipment  Construction 

Total 

t9»                        68.3 

5.6                      S.7 

ti.tf 

1997                            73.9 

6,0                        1.0 

ao.9 

1996*                          79.9 

S,2                        0.0 

85.1 

'Protected  expenses. 

Human  Genonne  Program  Operating  Furids  Distribution  in  FY  1996  (SK) 


FY  1996 

Mapping 

Sequencing 

Sequencing 
Technology 

Informatics 

ELSI 

Administration 

Totals 

% 

tmim>«xam 

8.980 

1t,m5 

11.128 

6.840 

3<3 

2,783* 

mum 

60,1 

AcaaenHc 

6.671 

4.368 

3.257 

6.178 

642 

4 

£f  120 

30,9 

Nonprofit 

$63 

0 

467 

2.783 

t,311 

38 

S.162 

7,5 

Faderal 

D 

0 

0 

0 

0 

1,000" 

ijtm 

15 

Tc«at        \^^ 

P*214 

15.363 

14.85S 

is.aoi 

2;J66 

3.83 

6».S41 

%otTo«ai 

23.8 

2ZJi 

21,7 

23.1 

3J     . 

54 

'.,-W^^fm 

•Includes  DOE  labofalorles'  nonresearch  costs  but  not  U.S  government  adminlstl^lon  or  SBIR 
"tX>E  contribution  to  the  International  Human  Fronters  Neurosclences  Program. 
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Communication 

The  EKJE  Human  Genome  Program 
communicates  information  in  a  variety 
of  ways.  These  communication  systems 
include  the  Human  Genome  Manage- 
ment Information  System  (HGMIS), 
projects  in  the  Ethical,  Legal,  and  Social 
Issues  (ELSI)  Program,  electronic  re- 
sources, meetings,  and  fellowships. 
Some  of  these  mechanisms  are  de- 
scribed below.  For  more  details,  see  Re- 
search Highlights.  ELSI  projects,  p.  18. 

HGMIS 

HGMIS  provides  technical  communica- 
tion and  information  services  for  the 
DOE  OBER  Human  Genome  Program 
Task  Group.  HGMIS  is  charged  with 

(1)  helping  to  communicate  genome- 
related  matters  and  research  to  contrac- 
tors, grantees,  other  (nongenome  project) 
researchers,  and  other  multipliers  of  in- 
formation pertaining  to  genetic  research; 

(2)  serving  as  a  clearinghouse  for  inquir- 
ies about  the  U.S.  genome  project;  and 

(3)  reducing  research  duplication  by  pro- 
viding a  forum  for  interdisciplinary  in- 
formation exchange  (including  resources 
developed)  among  genetic  investigators 
worldwide. 

HGMIS  publishes  the  newsletter  Human 
Genome  News,  sponsored  by  OBER. 
Over  14,000  HCyV  subscribers  include 
genome  and  basic  researchers  at  national 
laboratories,  universities,  and  other  re- 
search institutions;  professors  and  teach- 
ers; industry  representatives;  legal 
personnel;  ethicists;  students;  genetic 
counselors;  physicians;  science  writers; 
and  other  interested  individuals. 

HGMIS  also  produces  the  DOE  Primer 
on  Molecular  Genetics;  a  compilation  of 
ELSI  abstracts;  and  reports  on  the  DOE 
Human  Genome  and  Microbial  Genome 
Programs,  contractor-grantee  work- 
shops, and  other  related  subjects. 

Electronic  versions  of  the  primer  and 
other  HGMIS  publications  are  available 
via  the  Worid  Wide  Web.  HGMIS  also 


initiates  and  maintains  other  related 
Web  sites  (see  DOE  Electronic  Genome 
Resources  section  below  and  DOE  Web 
Sites  at  right). 

In  addition  to  their  print  and  online  pub- 
lishing efforts,  HGMIS  staff  members 
answer  questions  generated  via  Web 
sites,  telephone,  fax,  and  e-mail.  They 
also  furnish  customized  information 
about  the  genome  project  for  multipliers 
of  information  (contact:  Betty  Mansfield 
at  423/576-6669,  Fax:  /574-9888, 
mansfieldbk  @oml.gov). 

DOE  Electronic  Genome 
Resources 

Web  Sites.  The  DOE  Human  Genome 
Program  Home  Page  displays  pointers 
to  other  programs  within  OBER  and  the 
Office  of  Energy  Research.  Links  are 
made  to  additional  biological  and  envi- 
ronmental information  and  to  HGMIS, 
Genome  Database,  and  other  sites. 

HGMIS  initiates  and  maintains  the 
searchable  Human  Genome  Project  In- 
formation Web  site.  This  site  contains 
more  than  1 700  text  files  of  information 
for  multidisciplinary  technical  audiences 
as  well  as  for  lay  persons  interested  in 
learning  about  the  science,  goals, 
progress,  and  history  of  the  project.  Us- 
ers include  almost  all  levels  of  students; 
education,  medical,  and  legal  profes- 
sionals; genetic  society  and  support 
group  members;  biotechnology  and 
pharmaceutical  industry  personnel;  ad- 
ministrators; policymakers;  and  the  press. 

The  site  also  houses  a  section  of  fre- 
quently asked  questions,  a  quick  fact 
finder.  Primer  on  Molecular  Genetics, 
all  issues  of  Human  Genome  News, 
DOE  Human  Genome  Program  and 
contractor-grantee  workshop  reports. 
To  Knew  Ourselves,  historical  docu- 
ments, research  abstracts,  calendars  of 
genome  events,  and  hundreds  of  links  to 
genome  research  and  educational  sites. 
More  than  1 000  other  Web  pages  link  to 
this  site,  resulting  in  more  than  100,000 
text  file  transfers  each  month.  This 


DOE  Web  Sites 

DOF,  Huniiiri  (iL'tionie  Pmgrain 

lilip://wwK.vr.d<n\!;m/priHJucriiiii/ 
tihL'f/hu<i  ffifi.ht/iif 
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HGMIS  site  has  received  a  Four-Star 
designation  firom  the  Magellan  Group 
and  the  Editor's  Choice  Award  firom 
Looks  mart 

Genome-project  and  related  meetings 
are  listed  at  a  Web  site  (see  box,  p.  63), 
through  which  users  can  register  and 
submit  research  abstracts.  Another  listed 
related  site  discusses  issues  at  the  criti- 
cal intersection  of  genetics  and  the  court 
system.  This  Web  page  is  part  of  a 
project  to  educate  and  prepare  the  judi- 
ciary for  the  coming  onslaught  of  cases 
involving  genetic  issues  and  data. 

Newsgroup.  The  Human  Genome  Pro- 
gram Newsgroup  operates  through  the 
BIOSCI  electronic  bulletin  board  net- 
work to  allow  researchers  worldwide  to 
communicate,  share  ideas,  and  fmd  so- 
lutions to  problems.  Genome-related  in- 
formation is  distributed  through  the 
newsgroup,  including  requests  for  grant 
applications,  reports  from  recent  scien- 
tific and  advisory  meetings,  announce- 
ments of  future  events,  and  listings  of 
firee  software  and  services  {gnome-pr@ 
net.bio.net  or  http://www.bio.net). 

Postdoctoral  Fellowships 

OBER  established  the  Human  Genome 
Distinguished  Postdoctoral  Research 
Program  in  1990  to  support  research  on 
projects  related  to  the  DOE  Human  Ge- 
nome Program.  Beginning  in  FY  1996, 
the  Human  Genome  Distinguished 
Postdoctoral  Fellowships  were  merged 
with  the  Alexander  Hollaender  Distin- 
guished Postdoctoral  Fellowships, 
which  provide  support  in  all  areas  of 
OBER-sponsored  research.  Postdoctoral 
programs  are  administered  by  the  Oak 
Ridge  Institute  for  Science  and  Educa- 
tion, a  university  consortium  and  DOE 
contractor  For  additional  information, 
contact  Linda  Holmes  (423/576-3192, 
holme  si®  orau.gov)  or  see  the  Web  site 
(http://www.orau.gov/ober/hollaend. 
htm). 


Human  Genome  Distinguished 
Postdoctoral  Fellows 


Names  oi  past  and  current  <^aw$  in  sename  topics  are  given  below 
vxith  S»tr  research  institutions  «kJ  ti^as  <^  proposed  research.  For  1996 
research  aJBtracts,  reter  to  Index  of  PrintSpai  and  Coinvestigators  on 
p.  71  In  Part  2  ot  tftts  report. 

1 994    Mark  Qrav««  (6a^r  Coflege  o4  Medicine):  Graph  Data 
Modal*  for  Ger«)me  Mapping 

WHiam  Maw*  (Oui<e  University);  Synthesis  of  Peptide  Nucleic 
Acsds  for  DMA  Sequencing  by  Hybridization 

JbiSyue  Ju  (University  of  Catifomia,  Beriseley):  Des^, 
Synthesis,  and  Use  of  Oigonucteotide  Ptimers  Labeled  wHh 
Energy  Transfer-Coupleci  Dyes 

Hark  Starmoii  (Oak  Ridge  National  Laboratoiy):  Compara- 
6we  Study  of  a  Conserved  Zinc  RngerGene  Region 

1 9^  Cvan  Etehler  (Lawrertce  Uvenmore  Natiot^  Latxvatory): 
idenWicatlon,  OrganizaJion,  and  Characterization  of  Zinc 
Finger  Genes  in  a  2-Mb  Cluster  on  I9pl2 

Kelly  Ann  Fraser  (Uwrawa  Be*etey  National  L^xxatDfy): 
fri  Vivo  Complement^on  of  the  Murtne  Mutations  Griazied, 
Mocha,  and  Jitteri 

Soo-ln  Hwang  (Lawrence  Beriteley  National  Laboratory): 
Position^  Cloning  of  Oncogenes  on  20(j13.2 

Jams*  Lal>renz  (University  of  Wasfsnpjn,  Seattle):  Enor 
Analysis  of  Ptindpal  Se<juendng  Data  and  Its  Role  in  Process 
Optirnization  for  Genoine-Scate  Sequencing  Pr<^ects 

Marte  Ruiz-Martinez  (Northeastern  University}:  Multiplex 
Purfficatlon  Schemes  for  dna  Se<}u9o<sng-«eactlon  Products: 
ApplicaSon  to  Qat-Fitled  Capillary  Electrophoresis 

Todd  Smlfti  (University  of  Washi»>gton,  Seattie):  Managing  (he 
Flow  of  Large-Scate  DNA  Sequence  Infonnation 


Alexander  Hollaender  Distinguished 
Postdoctoral  Fellows  in  Genome  Research 


19% 


Cyntbeline  CuHot  (Oak  Ridge  NaSorat  Laboratory):  Cloning 
of  a  Mouse  Gene  Causing  Severe  Deafness  and  Balance 
Defet^ 

T«u-Mi4  Y)  (Latwratory  of  Staicturai  Biology  and  Molecular 
Medtefne,  Los  Angeles):  Structure-Function  Ar^^sis  of 
Alpha-Facior  Receptor 


1 997    Jeffrey  Koahl  (Los  Atamcs  National  Laboratory):  Conslructitvi, 
Anaty^,  and  Use  of  O^mal  01^  Mutation  Matrices 

SamM  MeCutcl<*n<IMon*y  (Lawrence  Uvenmore  National 
Laboratory):  Structure  and  Functioo  of  a  Dama^-SpecHic 
Endonudoase  Complex 
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Coordination  with  Other  Genome  Programs 


The  U.S.  Human  Geaome 
Project  is  supported  jointly 
by  the  Department  of  En- 
ergy (DOE)  and  the  Na- 
tional Institutes  of  Health 
(NIH),  each  of  which  emphasizes  dif- 
ferent facets.  The  two  agencies  coordi- 
nate their  efforts  through  development 
of  common  project  goats  and  joint  sup- 
port of  some  programs  addressing  ethi- 
cal, legal,  and  social  issues  (ELSI) 
arising  from  new  genome  tools,  tech- 
nology, and  data. 

Extraordinary  advances  in  genome  re- 
search are  due  to  contributions  by  many 
investigators  in  this  country  and  abroad. 
In  the  United  States,  such  research  (in- 
cluding nonhuman)  also  is  funded  by 
other  federal  agencies  and  private  foun- 
dations and  groups.  Many  countries  are 
major  contributors  to  the  project  through 
international  collaborations  and  their  own 
focused  programs.  Coordinating  and 
facilitating  these  diverse  research  ef- 
forts around  the  world  is  the  aim  of 
the  nongovernmental  international 
Human  Genome  Organisation. 

Some  details  of  U.S.  and  worldwide 
coordination  are  provided  below. 

U.S.  Human  Genome 
Project:  DOE  and  NIH 

In  1988  DOE  and  NIH  developed  a 
Memorandum  of  Understanding  that 
formalized  the  coordination  of  their  ef- 
forts to  decipher  the  human  genome  and 
thus  "enhance  the  human  genome  re- 
search capabilities  of  both  agencies."  In 
early  1990  they  presented  Congress 
with  a  joint  plan.  Understanding  Our 
Genetic  Inheritance.  The  U.S.  Human 
Genome  Project:  The  First  Five  Years 
(1991-1995).  Referred  to  as  the  Five- 
Year  Plan,  it  contained  short-term  scien- 
tific goals  fcv  the  coordinated,  multiyear 
research  project  and  a  comprehensive 
spending  plan.  Unexpectedly  rapid 
progress  in  mapping  prompted  early  re- 
vision of  the  original  S-year  goals  in  the 


faU  of  1993  [Science  262. 43-46  (Octo- 
ber I,  1993)).  Cuirenl  goals,  which  run 
through  September  30,  1998.  are  listed 
on  page  5;  text  of  both  5-year  plans  is 
accessible  via  the  Web  {http://www.omL 
gov/hgmis/project/hgp.hrm[)- 

DOE  and  NIH  have  adopted  a  joint 
policy  to  promote  sharing  of  genome 
data  and  resources  for  facilitating 
progress  and  reducing  duplicated  work. 
(See  Appendix  B:  DOE-NIH  Sharing 
Guidelines,  p.  75.) 

ELSI  Considerations 

NIH  and  DOE  devote  at  least  3%  of 
their  respective  gerrame  program  bud- 
gets to  identifying,  analyzing,  and  ad- 
dressing the  ELSI  considerations 
surrounding  genome  technology  and 
the  data  it  produces.  The  DOE  ELSI 
component  focuses  on  research  into 
the  privacy  and  confidentiality  of  per- 
sonal genetic  information,  genetics 
relevant  to  the  workplace,  commercial- 
ization (including  patenting)  of  gertome 
research  data,  and  genetic  education  for 
the  general  public  and  targeted  commu- 
nities. The  NIH  ELSI  component  sup- 
ports studies  on  a  range  of  ethical  issues 
surrounding  the  conduct  of  genetic  re- 
search and  responsible  clinical  integra- 
tion of  new  genetic  technologies, 
especially  in  testing  for  mutations  asso- 
ciated with  cystic  fibrosis  and  heritable 
breast,  ovarian,  and  colon  cancers. 

In  1990.  the  DOE-NIH  Joint  ELSI 
Working  Group  was  established  to 
identify,  address,  and  develop  policy 
options;  stimulate  bioethics  research; 
promote  education  of  professional  and 
lay  groups;  and  collaborate  with  such 
international  groups  as  the  Human  Ge- 
nome Organisation  (HUGO);  United 
Nations  Educational,  Scientific,  and 
Cultural  Organization;  and  the  Euro- 
pean Communit)'.  Research  funded  by 
the  U.S.  Human  Gettome  Project 
through  the  joint  working  group  has 
produced  policy  recommendations 
in  various  areas.  In  May  1993,  for 
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example,  the  DOE-NIH  Joint  ELSI 
Working  Group  Task  Force  on  Genetic 
Information  and  Insurance  issued  a  re- 
port with  recommendations  for  manag- 
ing the  impact  of  advances  in  human 
genetics  on  the  current  system  of 
healthcare  coverage.  In  1996,  the  work- 
ing group  released  guidelines  for  inves- 
tigators on  using  DNA  from  human 
subjects  for  large-scale  sequencing 
projects.  The  guidance  emphasizes  nu- 
merous ways  to  preserve  donor  ano- 
nymity [see  Appendix  C,  p.  77,  and  the 
World  Wide  Web  (http://www.oml.gov/ 
hgmis/archive/nchgrdoe.html)]. 

In  1997.  following  an  evaluation,  the 
two  agencies  modified  the  ELSI  work- 
ing group  into  the  ELSI  Research  and 
IVogram  Evaluation  Group  (ERPEG). 
ERPEG  will  focus  more  specifically  on 
research  activities  supported  by  DOE 
and  NIH  ELSI  programs. 

Other  U.S.  Programs 

The  potential  impact  of  genome  re- 
search on  society  and  the  rapid  growth 
of  the  biotechnology  industry  have 
spurred  the  initiation  of  other  genome 
research  projects  in  this  country  and 
worldwide.  These  projects  aim  to  create 
maps  of  the  human  genome  and  the  ge- 
nomes of  model  organisms  and  several 
economically  important  microbes, 
plants,  and  animals. 

•  The  DOE  Microbial  Genome  Pro- 
gram, begun  in  1994,  is  producing 
complete  genome  sequence  data  on 
industrially  important  microorgan- 
isms, including  those  that  live  under 
extreme  environmental  conditions. 
The  sequences  of  several  microbial 
genomes  have  been  completed. 
[http://www.er.doe.gov/production/ 
ober/EPR/mig_top.  html] 

•  In  1 990,  the  National  Science  Founda- 
tion, DOE,  and  the  U.S.  Department 
of  Agriculture  (USDA)  initiated  a 
project  to  map  and  sequence  the 
genome  of  the  model  plant  Arabidop- 
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sis  thaliana.  The  goal  of  this  project 
is  to  enhance  fundamental  understand- 
ing of  plant  processes.  In  1996,  the 
three  agencies  began  funding  system- 
atic, large-scale  genomic  sequencing 
of  the  1 20-megabase  Arabidopsis 
genome,  with  the  goal  of  completing 
it  by  2004,  with  DOE  support 
through  the  Office  of  Basic  Energy 
Sciences.  lhttp://pgec-genome.pw. 
usda.gov/agi.html] 

•  USDA  also  funds  animal  genome 
research  projects  designed  to  obtain 
genome  maps  for  economically  im- 
portant species  (e.g.,  com,  soybeans, 
poultry,  cattle,  swine,  and  sheep)  to 
enable  genetic  modifications  that  will 
increase  resistance  to  diseases  and 
pests,  improve  nutrient  value,  and 
increase  productivity. 

•  The  Advanced  Technology  Program 
(ATP)  of  the  U.S.  National  Institute 
of  Standards  and  Technology  pro- 
motes industry-government  partner- 
ships in  DNA  sequencing  and 
biotechnology  through  the  Tools  for 
DNA  Diagnostics  component.  DOE 
staff  participates  in  the  ATP  review 
process  (see  box,  p.  22).  [http://www. 
atp.nist.gov] 

•  In  1997  the  NIH  National  Cancer  In- 
stitute established  the  Cancer  Ge- 
nome Anatomy  Project  (CGAP)  to 
develop  new  diagnostic  tools  for  un- 
derstanding molecular  changes  that 
underlie  all  cancers  {http://www. 
ncbi.nlm.nih.gov/ncicgap).  DOE 
researchers  are  generating  clone 
libraries  to  support  this  effort. 

International 
Collaborations 

The  current  DOE-NIH  Five-Year  Plan 
commends  the  "spirit  of  international 
cooperation  and  sharing"  that  has  char- 
acterized the  Human  Genome  Project 
and  played  a  major  role  in  its  success. 
Cooperation  includes  collaborations 
among  laboratories  in  the  United  States 
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and  abroad  as  well  as  extensive  sharing 
of  materials  and  information  among 
genome  researchers  around  the  world. 
The  DOE  Human  Genome  Program 
supports  many  international  collabo- 
rations as  well  as  grantees  in  several 
foreign  institutions. 

Collaborations  involving  the  DOE  hu- 
man genome  centers  include  mapping 
chromosomes  16  and  19,  developing  re- 
sources, and  constructing  the  human 
gene  map  from  shared  cDNA  libraries. 
These  libraries  were  generated  by  the 
Integrated  Molecular  Analysis  of  Gene 
Expression  (called  IMAGE)  Consor- 
tium initiated  by  groups  at  Lawrence 
Livermore  National  Laboratory,  Colum- 
bia University,  NIH  National  Institute 
of  Mental  Health,  and  G£n6thon 
(France). 

Investigators  from  almost  every  major 
sequencing  center  in  the  world  met  in 
Bermuda  in  February  1 996  and  again  in 
1997  to  discuss  issues  related  to  large- 
scale  sequencing.  These  meetings  were 
designed  to  help  researchers  coordinate, 
compare,  and  evaluate  human  genome 
mapping  and  sequencing  strategies; 
consider  new  sequencing  and  infor- 
matics technologies;  and  discuss  re- 
lease of  data. 

Human  Genome 
Organisation 

Founded  by  scientists  in  1989,  HUGO 
is  a  nongovernmental  international 
organization  providing  coordination 
functions  for  worldwide  genome  efforts. 
HUGO  activities  range  from  support  of 
data  collation  for  constructing  genome 


maps  to  organizing  workshops.  HUGO 
also  fosters  exchange  of  data  and 
biomaterials,  encourages  technology 
sharing,  and  serves  as  a  coordinating 
agency  for  building  relationships  among 
various  govenunent  funding  agencies 
and  the  genome  community. 

HUGO  offers  short-term  (2-  to  10-week) 
travel  awards  up  to  $1500  for  investiga- 
tors under  age  40  to  visit  another  coun- 
try to  learn  new  methods  or  techniques 
and  to  facilitate  collaborative  research 
between  the  laboratories. 

HUGO  has  worked  closely  with  interna- 
tional funding  agencies  to  sponsor 
single-chromosome  workshops  (SCWs) 
and  other  genome  meetings.  Due  to  the 
success  of  these  workshops  as  well  as 
the  shift  in  emphasis  from  mapping  to 
sequencing,  DOE  and  NIH  began  to 
phase  out  their  funding  for  international 
SCWs  in  FY  1996  but  encouraged  appli- 
cations for  individual  SCWs  as  needed. 
In  1996,  HUGO  partially  funded  an  in- 
ternational strategy  meeting  in  Bermuda 
on  large-scale  sequencing.  Principles  re- 
garding data  release  and  a  resources  list 
developed  at  the  meeting  are  available 
on  the  HUGO  Web  site  {hnp://hugo.gdb. 
org/hugo.htmt). 

Membership  in  HUGO  (over  1000 
people  in  more  than  50  countries)  is 
extended  to  persons  concerned  with 
human  genome  research  and  related 
scientific  subjects.  Its  current  president 
is  Grant  R.  Sutherland  (Adelaide  Women 
and  Children's  Hospital,  Australia). 
Directed  by  an  18-member  interna- 
tional council,  HUGO  is  supported  by 
grants  from  the  Howard  Hughes  Medi- 
cal Institute  and  The  Wellcome  Trust. 


Countries  with 
Genome  Programs 


Countries  witii  genome 
prc^raois  or  strong  pro- 
grams is  human  genetics 
jnchKle  Australia,  Brarll, 
Canada,  China.  Denmark, 
European  Union,  France, 
Germany,  Israel,  Italy, 
Japan,  Korea,  Mexico, 
Netherlands,  Russia, 
Sweden,  United  Kingdom, 
and  United  States. 
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Appendix  A 
DOE  Human  Genome  Program:  Early  History,  Enabling  legislation 


A  brief  history  of  the  U.S.  Department  of  Energy  (DOE)  Hu- 
man Genome  Program  will  be  useful  in  a  discussion  of  the 
objectives  of  the  DOE  program  as  well  as  those  of  the  col- 
laborative U.S.  Human  Genome  Project.  The  DOE  Office  of 
Biological  and  Environmental  Research  (OBER)  of  DOE 
and  its  predecessor  agencies — the  Atomic  Energy  Commis- 
sion and  the  Energy  Research  and  Development  Administra- 
tion— have  long  sponsored  research  into  genetics,  both  in 
microbial  systems  and  in  mammals,  including  basic  studies 
on  genome  structure,  replication,  damage,  and  repair  and  the 
consequences  of  genetic  mutations.  (See  Appendix  E  for 
a  discussion  of  the  DOE  Biological  and  Environmental 
Research  Program.) 

In  1984,  OBER  [then  named  Office  of  Health  and  Environ- 
mental Research  (OHER)]  and  the  International  Commission 
on  Protection  Against  Environmental  Mutagens  and  Carcino- 
gens cosponsored  a  conference  in  Alta,  Utah,  which  high- 
lighted the  growing  roles  of  recombinant  DNA  technologies. 
Substantial  portions  of  the  meeting's  proceedings  were  incor- 
porated into  the  Congressional  Office  of  Technology  Assess- 
ment report.  Technologies  for  Detecting  Heritable  MutatiorLi 
in  Humans,  in  which  the  value  of  a  reference  sequence  of  the 
human  genome  was  recognized. 

Acquisition  of  such  a  reference  sequence  was,  however,  far 
beyond  the  capabilities  of  biomedical  research  resources 
and  infrastructure  existing  at  that  time.  Although  the 


small  genomes  of  several  microbes  had  been  mapped  or  par- 
tially sequenced,  the  detailed  mapping  and  eventual  sequenc- 
ing of  24  distinct  human  chromosomes  (22  autosomes  and 
the  sex  chromosomes  X  and  Y)  that  together  comprise  an 
estimated  3  billion  subunits  was  a  task  some  thousandsfold 
larger 

CKDE  OHER  was  already  engaged  in  several  multidisciplinary 
projects  contributing  to  the  nation's  biomedical  capabilities, 
including  the  GenBank  DNA  sequence  repository,  which 
was  initiated  and  sustained  by  DOE  computer  and  data- 
management  expertise.  Several  major  user  facilities  support- 
ing microstructure  research  were  developed  and  are  main- 
tained by  DOE.  Unique  chromosome-processing  resources 
and  capabilities  were  in  place  at  Los  Alamos  National  Labo- 
ratory and  Lawrence  Livermore  National  Laboratory.  Among 
these  were  the  fluorescence-activated  cell  sorter  (called 
FACS)  systems  to  purify  human  chromosomes  within  the 
National  Laboratory  Gene  Library  Project  for  the  production 
of  libraries  of  DNA  clones.  The  availability  of  these  mono- 
chromosomal  libraries  opened  an  important  path — a  practical 
means  of  subdividing  the  huge  total  genome  into  24  much 
more  manageable  components. 

With  these  capabilities,  OHER  began  in  1986  to  consider  the 
feasibility  of  a  dedicated  human  genome  program.  Leading 
scientists  were  invited  to  the  March  1986  international  con- 
ference at  Santa  Fe,  New  Mexico,  to  assess  the  desirability 


Enabling  Legislation 

In  the  United  States,  the  first  federal 
support  for  genetics  research  was 
through  the  Atomic  Energy  Commis- 
sion. In  die  early  days  of  nuclear  en- 
ergy development,  the  focus  was  on 
radiation  effects  and  later  broadened 
under  the  Energy  Research  and  De- 
vefofsneni  Administration  (ERDA) 
and  the  Departaient  of  Enetgy  to  in- 
clude the  health  implications  of  all 
energy  technologies  and  their 
by-products.  Major  en^Iing  legisla- 
tion follows. 

Atomic  Energy  Act  of  1946 

<Pi-.  79-585);  Provided  file  initial 
charter  for  a  consprehensive  program 
of  research  and  developroeot  related 
to  die  utilization  of  fissionable  and 


radioactive  materials  for  medical, 
bitrfogical,  and  health  purposes. 

Atomic  Energy  Act  of  1954 

(PJL.  83-703):  Further  authorized 
AEC  "to  concUict  research  on  the  bio- 
logic effects  of  ionizing  radiation." 

Energy  Reorganization  Act  of  1974 

(PI..  93^38):  Provided  that  responsi- 
bilities of  ERDA  should  include  "en- 
gaging in  and  supporting  environ' 
mental,  btooiedica},  physical,  and 
safay  research  related  to  the  develop- 
ment of  energy  resources  and  utiliza- 
tion technologies." 

Federal  NoB-Nudear  Energy 
Research  and  Devdopment  Act  of 
1974  (PX.  93-577):  Authorized 
ERDA  to  conduct*  eontiwebeosive 


non-nuclear  energy  research,  devel- 
opment, and  demonstration  prograin 
to  include  the  enviromnental  and  so- 
cial consequences  of  the  various  tech- 
nologies. 

DOE  Organization  Act  of  1977 

{Pi.  95-91):  Instructed  the  depart- 
ment "to  assure  incorporation  of  na- 
tional environmental  protection  goals 
in  the  formulation  and  implementa- 
tion of  energy  progjaflis;  and  to  ad- 
vance the  goal  of  restoring,  protect- 
ing, and  ei^iaDcing  environmental 
qnality.  and  assuring  public  heald) 
and  safety,"  and  to  conduct  "a  com- 
pteheosive  program  of  research  and 
development  on  the  environroenal 
effects  of  energy  technology  and 
programs." 
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and  feasibility  of  implementing  such  a  project  With  virtual 
unanimity,  participants  agreed  that  ordering  and  eventually 
sequencing  DNA  clones  representing  the  human  genome 
were  desirable  and  feasible  goals.  With  the  receipt  of  this 
enthusiastic  response.  OHER  initiated  several  pilot  projects. 
Program  guidance  was  further  sought  from  the  DOE  Health 
Effects  Research  Advisory  Committee  (HERAC). 


HERAC  Recommendation 

The  April  1987  HERAC  report  recommended  that  DOE  and 
the  nation  commit  to  a  large.  multidiscipUnary  scientific  and 
technological  undertaking  to  map  and  sequence  the  human 
genome.  DOE  was  particularly  well  suited  to  focus  on  re- 
source and  technology  development,  the  report  noted; 
HERAC  further  recommended  a  leadership  role  for  DOE 
because  of  its  demonstrated  expertise  in  managing  complex 
and  long-term  multidiscipUnary  projects  involving  both  the 
development  of  new  technologies  and  the  coordination  of 
efforts  in  industries,  universities,  and  its  own  laboratories. 


Evolution  of  the  nation's  Human  Genome  Project  further  ben- 
efited from  a  1988  study  by  the  National  Research  Council 
(NRC)  entitled  Mapping  and  Sequencing  the  Human  Ge- 
nome, which  recommended  that  the  United  States  support  this 
research  effort  and  presented  an  outline  for  a  multiphase  plan. 


DOE  and  NIH  Coordination 

The  National  Institutes  of  Health  (NIH)  was  a  necessary  par- 
ticipant in  the  large-scale  effort  to  map  and  sequence  the  hu- 
man genome  because  of  its  long  history  of  support  for  bio- 
medical research  and  its  vast  community  of  scientists.  This 
was  confirmed  by  the  NRC  report,  which  recommended  a 
major  role  for  NIH.  In  1987,  under  the  leadership  of  Director 
James  Wyngaarden,  NIH  established  the  Office  of  Genome 
Research  in  the  Directors  Office.  In  1988,  DOE  and  NIH 
signed  a  Memorandum  of  Understanding  in  which  the  agen- 
cies agreed  to  work  together,  coordinate  technical  research 
and  activities,  and  share  results.  In  1990,  DOE  and  NIH  sub- 
mitted a  joint  research  plan  outlining  short-  and  long-term 
goals  of  the  project 
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Appendix  B 
DOE-NfH  Guidelines  for  Sharing  Data  and  Resources 


Ai  in  Dfit'niher  7,  19^2.  inectw^.  th.c  DOliNltl  Joint  Suh- 
cummiltee  on  the  Human  Gtumitic  np/iroved  the  foCmvini; 
sharing  guidelines,  developed  from  the  DOE  drap  of  Septem- 
ber }  991* 

The  information  and  resources  generated  by  the  Human  Ge- 
nome Project  have  become  substantial,  and  the  interest  in 
having  access  to  them  is  widespread.  It  is  therefore  desirable 
to  have  a  statement  of  philosophy  concerning  the  sharing  of 
these  resources  that  can  guide  investigators  who  generate  the 
resources  as  well  as  those  who  wish  to  use  them. 

A  key  issue  for  the  Human  Genome  Project  is  how  to  pro- 
mote and  encourage  the  rapid  sharing  of  materials  and  data 
that  are  produced,  especially  information  that  has  not  yet 
been  published  or  may  never  be  published  in  its  entirety. 
Such  sharing  is  essential  for  progress  toward  the  goals  of  the 
program  and  to  avoid  unnecessary  duplication.  It  is  also  de- 
sirable to  make  the  fruits  of  genome  research  available  to  the 
scientific  community  as  a  whole  as  soon  as  possible  to  expe- 
dite research  in  other  areas. 

Although  it  is  the  policy  of  the  Human  Genome  Project  to 
maximize  outreach  to  the  scientific  community,  it  is  also  nec- 
essary to  give  investigators  time  to  verify  the  accuracy  of 
their  data  and  to  gain  some  scientific  advantage  from  the  ef- 
fort they  have  invested.  Furthermore,  in  order  to  assure  that 
novel  ideas  and  inventions  are  rapidly  developed  to  the  ben- 
efit of  the  public,  intellectual  property  protection  may  be 
needed  for  some  of  the  data  and  materials. 


After  extensive  discussion  with  the  community  of  genome 
researchers,  the  advisors  of  the  NIH  and  DOE  genome  pro- 
grams have  deterniined  that  consensus  is  developing  around 
the  concept  that  a  6-month  period  from  the  time  the  data  or 
materials  are  generated  to  the  time  they  are  made  available 
publicly  is  a  reasonable  maximum  in  almost  all  cases.  More 
rapid  sharing  is  encouraged. 

Whenever  possible,  data  should  be  deposited  in  public  data- 
bases and  materials  in  public  repositories.  Where  appropriate 
repositories  do  not  exist  or  are  unable  to  accept  the  data  or 
materials,  investigators  should  accommodate  requests  to  the 
extent  possible. 

The  NIH  and  DOE  genome  programs  have  decided  to  re- 
quire all  applicants  expecting  to  generate  significant  amounts 
of  genome  data  or  materials  to  describe  in  their  application 
how  and  when  they  plan  to  make  such  data  and  materials 
available  to  the  community.  Grant  solicitations  will  specify 
this  requirement.  These  plans  in  each  application  will  be  re- 
viewed in  the  course  of  peer  review  and  by  staff  to  assure 
they  are  reasonable  and  in  conformity  with  program  philoso- 
phy. If  a  grant  is  made,  the  applicant's  sharing  plans  will  be- 
come a  condition  of  the  award  and  compliance  will  be  re- 
viewed before  continuation  funding  is  provided.  Progress 
reports  will  be  asked  to  address  the  issue. 


♦Reprinted  from  Human  Genome  News  4(5),  4  (1993). 
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Introduction 

The  Human  Genome  Project  (HOP)  is  now  entering  into 
large-scale  DNA  sequencing.  To  meet  its  complete  sequenc- 
ing goal,  it  will  be  necessar>'  to  recruit  volunteers  willing  to 
contribute  their  DNA  for  this  purpose.  The  guidance  pro- 
vided in  this  document  is  intended  to  address  ethical  issues 
that  must  be  considered  in  designing  strategies  for  recruit- 
ment and  protection  of  DNA  donors  for  large-scale 
sequencing. 

Nothing  in  this  document  should  be  construed  to  differ  from, 
or  substitute  for,  the  policies  described  in  the  Federal  Regu- 
lations for  the  Protection  of  Human  Subjects  [45CFR46 
(NIH)  and  10CFR745  (DOE)].  Rather,  it  is  intended  to 
.supplement  those  policies  by  focusing  on  the  particular  is- 
sues raised  by  large-scale  human  DNA  sequencing.  This 
statement  addresses  six  topics:  ( 1 )  benefits  and  risks  of  ge- 
nomic DNA  sequencing;  (2)  privacy  and  confidentiality;  (3) 
recruitment  of  DNA  donors  as  sources  for  library  construc- 
tion; (4)  informed  consent;  (5)  IRB  approval;  and  (6)  use  of 
existing  libraries. 

The  guidance  provided  in  this  statement  is  intended  to  afford 
maximum  protection  to  DNA  donors  and  is  based  on  the  be- 
lief that  protection  can  best  be  achieved  by  a  combination  of 
approaches  including: 

•  ensuring  that  the  initial  version  of  the  complete  human 
DNA  sequence  is  derived  from  multiple  donors; 

•  providing  donors  with  the  opportimity  to  make  an  in- 
formed decision  about  whether  to  contribute  their  DNA 
to  this  project;  and 

•  taking  effective  steps  to  ensure  the  privacy  and  confi- 
dentiality of  donors. 

1.    Beneflts  and  Risks  of  Genomic  DNA 
Sequencing 

The  HGP  offers  great  promise  for  the  improvement  of  human 
health.  As  a  consequence  of  the  HGP,  there  will  be  a  more 
thorough  understanding  of  the  genetic  bases  of  human  biol- 
ogy and  of  many  diseases.  This,  in  turn,  will  lead  to  better 
therapies  and,  perhaps  more  importantly,  prevention  strate- 
gies for  many  of  those  diseases.  Similarly,  as  the  technology 
developed  by  the  HGP  is  applied  to  understanding  the  biol- 
ogy of  other  organisms,  many  other  human  activities  will  be 
affected  including  agriculture,  environmental  management, 
and  biologically  based  industrial  processes. 


Appendix  C 
NIH'DOE  Guidance  on  Human  Subjects  Issues 

in  Large-Scale  DNA  Sequencing 

Date  ixnui'd:  August  9.  1996 
While  the  HGP  offers  great  promise  to  humanity,  there  will 
be  no  direct  benefit,  in  either  clinical  or  financial  terms,  to 
any  of  the  individuals  who  choose  to  donate  DNA  for 
large-scale  sequencing.  Rather,  the  motivation  for  donation  is 
likely  to  be  an  altruistic  willingness  to  contribute  to  this  his- 
toric research  effort 


However,  individuals  who  donate  DNA  to  this  effort  may 
face  certain  risks.  Information  derived  from  the  donors  will 
become  available  in  public  databases.  Such  information  may 
reveal,  for  example.  DNA  sequence-based  information  about 
disease  susceptibility.  If  the  donor  becomes  aware  of  such 
information,  it  could  lead  to  emotional  distress  on  her/his 
part.  If  such  health-related  information  becomes  known  to 
others,  discrimination  against  the  donor  (e.g.,  in  insurance  or 
in  employment)  could  result.  Unwanted  notoriety  is  another 
potential  risk  to  donors.  Therefore,  those  engaged  in 
large-scale  sequencing  must  be  sensitive  to  the  unique  fea- 
tures of  this  type  of  research  and  ensure  that  both  the  protec- 
tions normally  afforded  research  subjects  and  the  special  is- 
sues associated  with  human  genomic  DNA  sequencing  are 
thoroughly  addressed. 

While  some  risks  to  donors  can  ah'eady  be  identified,  the 
probability  of  adverse  events  materializing  appears  to  be 
low.  However,  the  risks  of  harm  to  individuals  will  increase 
if  confidentiality  is  not  maintained  and/or  the  number  of  do- 
nors is  limited  to  a  very  few  individuals.  Either,  or  both,  of 
these  situations  would  increase  the  possibility  of  a  donor's 
identity  being  revealed  without  his/her  knowledge  or 
permission. 

A  final  issue  to  consider  is  characterized  in  a  statement  taken 
from  the  OPRR  Guidebook'  which  points  out  that  "some  ar- 
eas [of  genetic  research]  present  issues  for  which  no  clear 
guidance  can  be  given  at  this  point,  either  because  enough  is 
not  known  about  the  risks  presented  by  the  research,  or  be- 
cause no  consensus  on  the  appropriate  resolution  of  the  prob- 
lem exists."  It  is  anticipated  that  the  DNA  sequence  informa- 
tion produced  by  the  Human  Genome  Project  will  be  used  in 
the  future  for  types  of  research  which  cannot  now  be  pre- 
dicted and  the  risks  of  which  cannot  be  assessed  or  disclosed. 

2.  Privacy  and  Confldentiality 

In  genera],  one  of  the  most  effective  ways  of  protecting  vol- 
unteers from  the  unexpected,  unwelcome  or  unauthorized  use 
of  information  about  them  is  to  ensure  that  there  are  no  op- 
portunities for  linking  an  individual  donor  with  information 
about  him/her  that  is  revealed  by  the  research.  By  not  col- 
lecting information  about  the  identity  of  a  research  subject 
and  any  biological  material  or  records  developed  in  the 
course  of  the  research,  or  by  subsequently  removing  all 
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identifiers  ("anonymizing"  the  sample),  the  possibility  of  risk 
to  the  subject  stemming  from  the  results  of  the  research  is 
greatly  reduced.  Large-scale  DNA  sequence  determination 
represents  an  exception  because  each  person's  DNA  sequence 
is  unique  and,  ultimately,  there  is  enough  information  in  any 
individual's  DNA  sequence  to  absolutely  identify  her/him. 
However,  the  technology  that  would  allow  the  unambiguous 
identification  of  an  individual  from  his/her  DNA  sequence  is 
not  yet  mature.  Thus,  for  the  foreseeable  future,  establishing 
effective  confidentiality,  rather  than  relying  on  anonymity, 
will  be  a  very  useful  approach  to  protecting  donors. 

Investigators  should  introduce  as  many  disconnects  between 
the  identity  of  donors  and  the  publicly  available  information 
and  materials  as  possible.  There  should  not  be  any  way  for  any- 
one to  establish  that  a  specific  DNA  sequence  came  from  a  par- 
ticular individual,  other  than  resampling  an  individual's  DNA 
and  comparing  it  to  the  sequence  information  in  the  public  data- 
base. In  particular,  no  phenotypic  or  demographic  information 
about  donors  should  be  linked  to  the  DNA  to  be  sequenced.^ 
For  the  purposes  of  the  HGP  such  information  will  rarely  be 
useful,  and  recording  such  information  could  result  in  possible 
misuse  and  compromise  donor  confidentiality. 

Confidentiality  should  be  "two  way."  Not  only  should  others 
be  unable  to  link  a  DNA  sequence  to  a  particular  individual, 
but  no  individual  who  donates  DNA  should  be  able  to  confirm 
directly  that  a  particular  DNA  sequence  was  obtained  firom 
their  DNA  sample.'  This  degree  of  confidentiality  will  pre- 
clude the  possibility  of  re-contacting  DNA  donors,  providing 
another  degree  of  protection  for  them.  It  should  be  clear  to 
both  investigators  and  to  donors  that  the  contact  involved  in 
obtaining  the  initial  specimen  will  be  the  only  contact.'' 

Another  approach  for  protecting  all  DNA  donors  is  to  reduce 
the  incentive  for  wanting  to  know  the  identities  of  particular 
donors.  If  the  initial  human  sequence  is  a  "mosaic"  or  "patch- 
work" of  sequenced  regions  derived  ftom  a  number  of  differ- 
ent individuals,  rather  than  that  of  a  single  individual,  there 
would  be  considerably  less  interest  in  who  the  specific  donors 
were.  Although  there  may  be  scientific  justification  that  each 
clone  Ubrary  used  for  sequencing  should  be  derived  firom  one 
person,  there  is  no  scientific  reason  that  the  entire  initial  hu- 
man DNA  sequence  should  be  that  of  a  single  individual.  As 
approximately  99.9%  of  the  human  DNA  sequence  is  common 
between  any  two  individuals,  most  of  the  fundamental  bio- 
logical information  contained  in  the  human  DNA  sequence  is 
common  to  all  people. 

To  increase  the  likelihood  that  the  first  human  DNA  sequence 
will  be  an  amalgam  of  regions  sequenced  from  different 
sources,  a  number  of  clone  libraries  must  be  made  available. 
Although  a  number  of  large  insert  libraries  have  been  made, 
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most  do  not  meet  all  of  the  standards  set  in  this  document; 
therefore,  these  libraries  should  be  used  as  substrates  for 
large-scale  sequencing  only  under  circumscribed  conditions 
(see  section  6,  p.  79).  Starting  immediately,  new  libraries 
will  be  developed  that  have  the  advantage  of  being  con- 
structed in  accordance  with  the  ethical  principles  discussed 
in  this  document;  they  may  also  confer  some  additional  sci- 
entific benefit.  Such  libraries  are  critical  for  the  long-range 
needs  of  the  HGP. 

3.  Source/Recruitment  of  DNA  Donors 
for  Library  Construction 

Another  implication  of  the  fact  that  99.9%  of  the  human 
DNA  sequence  is  shared  by  any  two  individuals  is  that  the 
backgrounds  of  the  individuals  who  donate  DNA  for  the  first 
human  sequence  will  make  no  scientific  difference  in  terms 
of  the  usefulness  and  applicability  of  the  information  that 
results  from  sequencing  the  human  genome.  At  the  same 
time,  there  will  undoubtedly  be  some  sensitivity  about  the 
choice  of  DNA  sources.  There  are  no  scientific  reasons  why 
DNA  donors  should  not  be  selected  from  diverse  pools  of 
potential  donors.' 

There  are  two  additional  issues  that  have  arisen  in  consider- 
ing donor  selection.  These  warrant  particular  discussion: 

•  It  is  recognized  that  women  have  historically  been 
underrepresented  in  research,  so  it  can  be  anticipated 
that  concerns  might  arise  if  males  (sperm  DNA)  were 
used  exclusively  as  the  source  of  DNA  for  large-scale 
sequencing.  Although  there  would  be  no  scientific  basis 
for  concern,  because  even  in  the  case  of  a  male  source, 
half  of  the  donor's  DNA  would  have  come  from  his 
mother  and  half  from  his  father,  nevertheless  perceptions 
are  not  to  be  dismissed.  While  the  choice  of  donors  will 
not  be  dictated  to  investigators,  it  is  expected  that,  be- 
cause multiple  libraries  will  be  produced,  a  number  of 
them  will  be  made  from  female  sources  while  others  will 
be  made  from  male  sources. 

*  Staff  of  laboratories  involved  in  library  construction  and 
DNA  sequencing  may  be  eager  to  volunteer  to  be  donors 
because  of  their  interest  and  belief  in  the  HGP.  However, 
proximity  to  the  research  may  create  some  special  vul- 
nerabilities for  laboratory  staff  members.  It  is  also  pos- 
sible that  they  will  feel  pressure  to  donate  and  there  may 
be  an  increased  likelihood  that  confidentiality  would  be 
breached.  Finally,  there  is  a  potential  that  the  choice  of 
persons  so  closely  involved  in  the  research  may  be  inter- 
preted as  elitist  For  all  of  these  reasons,  it  is  recom- 
mended that  donors  should  not  be  recruited  from  labora- 
tory staff,  including  the  principal  investigator. 
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4.  Informed  Consent 

Obtaining  informed  consent  specifically  for  the  purpose  of 
donating  DNA  for  large-scale  sequencing  raises  some  unique 
concerns.  Because  anonymity  cannot  be  guaranteed  and  con- 
fidentiality protections  are  not  absolute,  the  disclosure  pro- 
cess to  potential  donors  must  clearly  specify  what  the  pro- 
cess of  DNA  donation  involves,  what  may  make  it  different 
from  other  types  of  research,  and  what  the  implications  are 
of  one's  DNA  sequence  information  being  a  public  scientific 
resource. 

Federal  regulations  C45CFR46  and  10CFR745)  require  the 
disclosure  of  a  number  of  issues  in  any  informed  consent 
document.  They  include  such  issues  as  potential  benefits  of 
the  research,  potential  risks  to  the  donor,  control  and  owner- 
ship of  donated  material,  long-term  retention  of  donated  ma- 
terial for  future  use,  and  the  procedures  that  will  be  followed. 
In  addition,  there  are  several  other  disclosures  that  are  of 
special  importance  for  donors  of  DNA  for  large-scale  se- 
quencing. These  include: 

•  the  meaoing  of  confidentiality  and  privacy  of  informa- 
tion in  the  context  of  large-scale  DNA  sequencing,  and 
how  these  issues  will  be  addressed; 

•  the  lack  of  opportunity  for  the  donor  to  later  withdraw 
the  libraries  made  from  his/her  DNA  or  his/her  DNA 
sequence  information  from  public  use; 

•  the  absence  of  opportunity  for  information  of  clinical 
relevance  to  be  provided  to  the  donor  or  her/his  family; 

•  the  possibility  of  unforeseen  risks;  and 

•  the  possible  extension  of  risk  to  family  members  of  the 
donor  or  to  any  group  or  community  of  interest  (e.g., 
gender,  race,  ethnicity)  to  which  a  donor  might  belong. 

Many  academic  human  genetics  units  have  considerable  ex- 
perience in  dealing  with  research  subjects  and  obtaining  in- 
formed consent,  while  the  laboratories  that  are  likely  to  be 
involved  in  making  the  libraries  for  sequencing  have,  in  gen- 
eral, much  less  experience  of  this  type.  Therefore,  library 
makers  are  encouraged  to  establish  a  collaboration  with  one 
or  more  human  genetics  units,  with  the  latter  being  respon- 
sible for  recruiting  donors,  obtaining  informed  consent,  ob- 
taining the  necessary  biological  samples,  and  providing  a 
blinded  sample  to  the  library  maker  Collaboration  with  tis- 
sue banks  may  be  considered  as  long  as  these  banks  are  col- 
lecting tissues  in  accordance  with  this  guidance.  The  library 
maker  should  have  no  contact  with  the  donor  and  no  oppor- 
tunity to  obtain  any  information  about  the  donor's  identity. 


5.  IRB  Approval 

Effective  immediately,  projects  to  construct  libraries  for 
large-scale  DNA  sequencing  must  obtain  Institutional  Re- 
view Board  (IRB)  approval  before  work  is  initiated.  IRBs 
should  carefully  consider  the  unique  aspects  of  large-scale 
sequencing  projects.  Some  of  the  informed  consent  provi- 
sions outlined  may  be  somewhat  at  odds  with  the  usual  and 
customary  disclosures  found  in  most  protocols  involving  hu- 
man subjects  and  which  IRBs  usually  consider  For  example, 
research  subjects  usually  are  given  the  opportunity  to  with- 
draw from  a  research  project  if  they  change  their  minds 
about  participating.  In  the  case  of  donors  for  large-scale  se- 
quencing, it  will  not  be  possible  to  withdraw  either  the  librar- 
ies made  from  their  DNA  or  the  DNA  sequence  information 
obtained  using  those  libraries  once  the  information  is  in  the 
public  domain.  By  the  time  a  significant  amount  of  DNA  se- 
quence data  has  been  collected,  the  libraries,  as  well  as  indi- 
vidual clones  from  them,  will  have  been  widely  distributed 
and  the  sequence  information  will  have  been  deposited  in 
and  distributed  from  public  databases.  In  addition,  there  will 
be  no  possibility  of  returning  information  of  clinical  rel- 
evance to  the  donor  or  his/her  family. 

6.  Use  of  Existing  Libraries  for 
Large-Scale  Sequencing 

Many  of  the  existing  libraries  (including  those  derived  from 
anonymous  donors)  were  not  made  in  complete  conformity 
with  the  principles  elaborated  above.  The  potential  risks  that 
may  result  from  their  use  will  be  minimized  by  the  rapid  in- 
troduction of  several  new  libraries  constructed  in  accordance 
with  this  guidance,  which  NCHGR  and  DOE  are  taking  steps 
to  initiate.  This  will  ensure  that  the  existing  libraries  will 
only  contribute  small  amoimts  to  the  first  complete  human 
DNA  sequence.  In  the  interim,  existing  libraries  can  continue 
to  be  used  for  large-scale  sequencing,  only  if  IRB  approval 
and  consent  for  "continued  use"  are  obtained''  and  approval 
by  the  funding  agency  is  granted. 

It  is  important  that  in  obtaining  consent  for  contined  use  of 
existing  libraries,  no  coercion  of  the  DNA  donor  occur.  It  is 
therefore  recommended  that  consideration  be  given  to 
whether  it  is  appropriate  for  the  individual  who  previously 
recruited  the  donor  to  recontact  him/her  to  obtain  this  con- 
sent. In  some  cases  an  IRB  may  determine  that  the  recontact 
should  be  made  by  a  third  party  to  assure  diat  the  donors  are 
fully  informed  and  allowed  to  choose  freely  whether  their 
DNA  can  continue  to  be  used  for  this  purpose. 
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Conclusion 

This  document  is  intended  to  provide  guidance  to  investiga- 
tors and  IRBs  who  are  involved  in  large-scale  sequencing 
efforts.  It  is  designed  to  alert  them  to  special  ethical  con- 
cerns that  may  arise  in  such  projects.  In  particular,  it  pro- 
vides guidance  for  the  use  of  existing  and  the  construction 
of  new  DNA  libraries.  Adhering  to  this  guidance  will  ensure 
that  the  initial  version  of  the  complete  human  sequence  is 
derived  from  multiple,  diverse  donors;  that  donors  will  have 
the  opportunity  to  make  an  informed  decision  about 
whether  to  contribute  their  DNA  to  this  project;  and  that 
effective  steps  will  be  taken  by  investigators  to  ensure  the 
privacy  and  confidentiality  of  donors. 

Investigators  funded  by  NCHGR  and  DOE  to  develop  new 
libraries  for  large-scale  human  DNA  sequencing  will  be  re- 
quired to  have  their  plans  for  the  recruitment  of  DNA  do- 
nors, including  the  informed  consent  documents,  reviewed 
and  approved  by  the  funding  agency  before  donors  are  re- 
cruited. Investigators  involved  in  large-scale  human  se- 
quencing will  also  be  asked  to  observe  those  aspects  of  this 
guidance  that  pertain  to  them. 

Approved  August  17,  19%,  by: 

Francis  S.  Collins,  M.D.,  Ph.D.,  Director,  National  Center 

for  Human  Genome  Research,  National  Institutes 

of  Health 
Aristides  N.  Patrinos,  Ph.D.,  Associate  Director,  Office  of 

Health  and  Environmental  Research,  U.S.  Department 

of  Energy 


Footnotes 

1 .  Office  of  Protection  firom  Research  Risks,  Protecting 
Human  Research  Subjects:  Institutional  Review  Board 
Guidebook  (OPRR:  U.S.  Government  Printing  Office, 
1993). 

2.  It  is  recognized  that  it  wUl  be  trivially  easy  to  deter- 
mine the  sex  of  the  donor  of  the  library,  by  assaying  for  the 
presence  or  absence  of  Y  chromosome  in  the  library. 

3.  There  are  a  number  of  approaches  to  preventing  a 
DNA  donor  from  knowing  that  his/her  DNA  was  actually 
sequenced  as  part  of  the  HGP.  For  example,  each  time  a 
clone  library  is  to  be  made,  an  appropriately  diverse  pool  of 
between  five  and  ten  volunteers  can  be  chosen  in  such  a 
way  that  none  of  them  knows  the  identity  of  anyone  else  in 
the  pool.  Samples  for  DNA  preparation  and  for  preparation 
of  a  cell  line  can  be  collected  from  all  of  the  volunteers 
(who  have  been  told  that  their  specimen  may  or  may  not 
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eventually  be  used  for  DNA  sequencing)  and  one  of  those 
samples  is  randomly  and  blindly  selected  as  the  source  actu- 
ally used  for  library  construction.  In  this  way,  not  only  will 
the  identity  of  the  individual  whose  DNA  is  chosen  not  be 
known  to  the  investigators,  but  that  individual  will  also  not 
be  sure  that  s/he  is  the  actual  source. 

4.  Although  recontacting  donors  should  not  be  possible, 
investigators  will  potentially  want  to  be  able  to  resample  a 
donor's  genome.  Thus,  at  the  time  the  initial  specimen  is  ob- 
tained, in  addition  to  making  a  clone  library  representing  the 
donor's  genome,  it  should  also  be  used  to  prepare  an  addi- 
tional aliquot  of  high  molecular  weight  DNA  for  storage  and 
a  permanent  cell  line.  Either  resource  could  then  be  used  as  a 
source  of  the  donor's  genome  in  case  additional  DNA  were 
needed  or  comparison  with  the  results  of  the  analysis  of  the 
cloned  DNA  were  desired. 

5.  There  has  been  discussion  in  the  scientific  community 
about  the  sex  of  DNA  donors.  A  library  prepared  fitim  a  fe- 
male donor  will  contain  DNA  from  the  X  chromosome  in  an 
amount  equivalent  to  the  autosomes,  but  will  completely  lack 
Y  chromosomal  DNA.  Conversely,  a  library  prepared  from  a 
male  donor  will  contain  Y  DNA,  but  both  X  and  Y  DNA  will 
only  be  present  at  half  the  frequency  of  the  DNA  from  the 
other  chromosomes.  Scientifically,  then,  there  are  both  ad- 
vantages and  disadvantages  inherent  in  the  use  of  either  a 
male  or  a  female  donor.  The  question  of  the  sex  of  the  donor 
also  involves  the  question  of  the  use  of  somatic  or  germ  line 
DNA  to  make  libraries.  For  making  libraries,  useful  amounts 
of  germ  line  DNA  can  only  be  obtained  from  a  male  source 
(i.e.,  from  sperm);  it  is  not  possible  to  obtain  enough  ova 
bom  a  female  donor  to  isolate  germ  line  DNA  for  this  pur- 
pose. Opinion  is  divided  in  the  scientific  community  about 
whether  germ  line  or  somatic  DNA  should  be  used  for 
large-scale  sequencing.  Somatic  DNA  is  known  to  be  rear- 
ranged, relative  to  germ  line  DNA,  in  certain  regions  (e.g., 
the  immtmoglobulin  genes)  and  the  possibility  has  been 
raised  that  other  developmentally  based  rearrangements  may 
occur,  although  no  example  of  the  latter  has  been  offered. 
While  some  believe  that  the  sequence  product  should  not 
contain  any  rearrangements  of  this  sort,  others  consider  this 
potential  advantage  of  germ  line  DNA  to  be  relatively  minor 
in  comparison  to  the  need  to  have  the  X  chromosome  fully 
represented  in  sequencing  efforts  and  prefer  the  use  of  so- 
matic DNA. 

6.  Individuals  whose  DNA  was  used  for  library  construc- 
tion (with  the  exception  of  those  created  from  deceased  or 
anonymous  individuals)  should  be  fiiUy  informed  about  the 
risks  and  benefits  described  above,  should  freely  choose 
whether  they  would  like  their  DNA  to  continue  to  be  used  for 
this  purpose,  and  their  decision  should  be  documented. 
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Executive  Summary  of  Joint 
NIH-DOE  Human  Subjects 
Guidance 

1 .  Those  engaged  in  large-scale  sequencing  must  be 
sensitive  to  the  unique  features  of  this  type  of  research 
and  ensure  that  both  the  protections  normally  afforded 
research  subjects  and  the  special  issues  associated  with 
human  genomic  DNA  sequencing  are  thoroughly 
addressed. 

2.  For  the  foreseeable  future,  establishing  effective 
confidentiality,  rather  than  relying  on  anonymity,  will  be 
a  very  useful  approach  to  protecting  donors. 

3.  Investigators  should  introduce  as  many  disconnects 
between  the  identity  of  donors  and  the  publicly  available 
information  and  materials  as  possible. 

4.  No  phenotypic  or  demographic  information  about 
donors  should  be  linked  to  the  DNA  to  be  sequenced. 

5.  There  are  no  scientific  reasotis  why  DNA  donors  should 
not  be  selected  from  diverse  pools  of  potential  donors. 

6.  While  the  choice  of  donors  will  not  be  dictated  to 
investigators,  it  is  expected  that,  because  multiple 
libraries  will  be  produced,  a  number  of  them  will  be 
made  from  female  sources  while  others  will  be  made 
from  male  sources. 


7.  It  is  recommended  that  donors  should  not  be  recruited 
from  laboratory  staff,  including  the  principal  investigator 

8.  The  disclosure  process  to  potential  donors  must  clearly 
specify  what  the  process  of  DNA  donation  involves, 
what  may  make  it  different  from  other  types  of  research, 
and  what  the  implications  are  of  one's  DNA  sequence 
information  being  a  public  scientific  resource. 

9.  Library  makers  are  encouraged  to  establish  a  collabora- 
tion with  one  or  more  human  genetics  units  [or  tissue 
banks]. 

10.  The  library  maker  should  have  no  contact  with  the  donor 
and  no  opportunity  to  obtain  any  information  about  the 
donor's  identity. 

1 1 .  Effective  immediately,  projects  to  construct  libraries  for 
large-scale  DNA  sequencing  must  obtain  Institutional 
Review  Board  (IRB)  approval  before  work  is  initiated. 

12.  Existing  libraries  can  continue  to  be  used  for  large-scale 
sequencing,  only  if  IRB  approval  and  consent  for 
continued  use  are  obtained  and  approval  by  the  funding 
agency  is  granted. 

13.  It  is  important  that  in  obtaining  informed  consent  for 
continued  use  of  existing  libraries,  no  coercion  of  the 
DNA  donor  occur. 
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Appendix  D 
Human  Genome  Project  and  Genetics  on  the  World  Wide  Web 

August  lf'>7 


Tlic  World  Wide  Wf  b  offrrs  thf  easiest  mith  to  infonniition 
about  the  lliiiiuiii  Oriioiiie  Project  and  n-lated  genetics  topics. 
Some  useful  sites  to  visit  arc  included  in  the  list  below. 

Human  (>eii(>nu>  Project 

IH>K  Hunmn  Gcnumc  I'roKrum 

http://www.rr.ilor  gKv/prxHluition/obfr/huglop.hlml 

Devoted  to  the  DOR  coiii|H>ncnt  of  the  U.S.  Human  Ge- 
nome Pniject  and  to  the  DOR  Micmbial  trenome  Pro- 
gram. Linlcs  to  many  other  sites. 

Human  Genome  Projrc I  Information 

http://www.  orrtl.  f;t>v/hymi.s 

Comprehensive  site  covering  topics  related  to  the  US 
and  worldwide  Hunum  Genome  Projects  Useful  for  up 
dating  scicntist.s  and  providing  educational  material  for 
nonscientists.  in  sup(M>n  of  DOF/s  commitment  to  public 
education.  Develo(H-d  and  maintained  for  DOF.  by  the 
Hunuin  Genome  Management  Infomuition  System 
(HGMIS)  at  Oak  Ridgc  National  Laboratory. 

Mil  National  Human  Genome  Research  Institute 

http://www.  nhgn.nih.fiiH' 

Site  of  the  NIH  sector  of  tlie  U.S.  Hunuin  Genome 
Project. 

IK)K  Human  Genome  Progrum 
Publications 

*  Human  Genome  News 

http://www.oml.gOv/hgmi.s/puhlicat/puhlicalion.t.html 

Quarterly  newsletter  ir(xirting  on  the  worldwide  Human 
GeiHime  Project 

KioloKieal  Sciences  Curricuiuni  Study  (BSGS)  Teaching 
Modules 

Ctnline  vciMons  in  preparation:  hardcopies  available 
ftom7m/.S.M  .S.S.SO 

'Xiencs.  Environment,  and  Human  Behavior,"  tenta- 
tive title,  in  preparation 

•  "Mapping  and  Sequencing  the  Human  Genome: 
ScieiK-e,  F.thics.  and  Public  Policy"  (l'»<>2) 

•  "The  Human  Genome  Project:  Biology,  Computcpi. 
and  Privacy"  ( I O^) 

*PriM  copy  available  fn>m  HCMIS  (Me  p.  87  ot  inside  fronl  cover 
for  coniact  infomiaiion). 


•      "The  Pu7.zle  of  Inheritance:  Genetics  and  the  Meth- 
ods of  Science"  (I  Q97) 

'Primer  on  Molecular  Genetics,  1992 

hnp:/Avww.oml.gnv/hgmi.t/puhlicat/publicatioru. 
htmWprimrr 

Explains  the  science  behind  the  genome  research. 

•lb  Know  Ourselves,  1996 

http://www.oml.gOv/hgmi.t/tko 

Booklet  reviewing  DOE's  role,  history,  and  achieve- 
ments in  the  Human  Genome  Project  and  introducing 
the  science  and  other  aspects  of  the  project. 

Ethical,  i.egul,  and  Social  Issues  Related 
to  Genetics  Research 

HCMIS  Gateways  Web  page 

http://www.oml  gov^gmi.t/link.t.html 

Choose  "Ethical,  Ixgal,  and  Social  Issues." 

("enter  for  Bioethics.  University  of  Pennsylvania 

http://www.mrd.uprnn.rdu/~hioethic 

Full-text  articles  about  such  ethical  i.s.sues  as  human 
cloning;  includes  a  primer  on  bioethics. 

Courts  and  Science  On-Mnr  Magazine  (CASOI.M) 

http:/Avww.oml.gov/court.t 

Coverage  of  genetic  issues  affecting  the  courts. 

EI^I  in  Science 

http://www.lhl.gov/Education/ElSI/ELSI.html 

Teaching  modules  designed  to  .stimulate  discussion  on 
implications  of  scientific  research. 

Kubios  Kthlrs  Institute 

http://wwwhiol.t.tukuha.ac  jp/~macrr/indrx.html 

Site  includes  newsletter  summarizing  literaturc  in  bio- 
ethics and  biotechnology. 

Genetic  Privacy  Act 

http:/Avww.oml  gov/hgmi.t/rr.tourrr/rl.ti.htm\ 

Model  legislation  written  with  support  of  the  DOE  Hu- 
man Genome  Program. 

MCKT — The  Human  Genome  Project 

http://phorni\  mcri.rdu/humangrnome/index.html 

ELSI  issues  for  high  scliool  students. 
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National  Kiocthics  Advisory  Committee 

hllp://www.nih.t;()v/nhar/nha(  him 

The  bioctiiics  committee  offers  advice  to  the  National 
Science  and  Technology  Council  and  others  on  bioethi 
ca!  issues  arising  from  research  related  to  human  biol- 
ogy and  behavior. 

National  Center  for  Genomic  Kesourceii 

hitp://www.ncgr.org 

Comprehensive  Genetics  and  Public  Issues  page;  in- 
cludes congressional  bills  related  to  genetic  privacy. 

The  Gene  l^etter 

http://www.geneletler.org/genelalk.hlml 

Bimonthly  newsletter  to  inform  consumers  and  profes- 
sionals about  advances  in  genetics  and  encourage  dis- 
cussion about  emerging  policy  dilemmas. 

Your  Gene»,  Your  Choices 

hllp://www.nextwave.org/ehr/hook.i/index.html 

Boolciet  written  in  simple  English,  describing  the  Hu- 
man Genome  Project;  the  science  behind  it;  and  how 
ethical,  legal,  and  social  issues  raised  by  the  project  may 
affect  people's  everyday  lives. 

General  Genetics  and  Biotechnology 

Many  of  the  following  sites  contain  links  to  both  educational 
and  technical  material. 

HGMIS  Community  Education  and  Outreach  Gateways 
Web  Page 

http://www.oml.gOv/hgmi.s/Unks.httnl 

Acuta  Excellence 

http://outca.sl.gene. (om/ae/index.html 

Extensive  genetic  and  biotechnology  resources  for 
teachcn  and  nonscientisLs. 

BIO  Online  (Biotechnolu|;y  Industry  Organization) 

http://www.hto. ( om 

Comprehensive  directory  of  biotechnology  sites  on  the 
Internet. 

Biospace 

http://www.biospace.com 

Biotech  industry  site;  profiles  biotech  companies  by 
region. 

BioTech 

http://hiotech.chem.indiana.edu 


An  interactive  educational  resource  and  biotech  refer- 
ence t<H)l;  includes  a  dictionary  of  6(J(J(J  life  science 
terms. 

BiotechnoloKy  Information  Center,  tSDA  National 
Agricultural  Library 

http://www.  nal.  u.sda.gov/hic 

Comprehensive  agricultural  biotechnology  resource; 
includes  a  bibliography  on  palenling  biotechnology 
products  and  processes  (.http://www.nal.usda.g0v/hi1:/ 
Bihlios/patentag.hlm). 

Bugs  'N  Stuff 

http://www.ncgr.org/microhe 

List  of  microbial  genomes  being  sequenced,  research 
groups,  genome  sizes,  and  fact*  aliout  selected  organ- 
isms. Linlcs  to  related  sites. 

Careers  in  Genetics 

hiip //www jasrh.org/genetics/gsa/careers/hro-menu.htm 

Online  booklet  from  the  Genetics  Society  of  America, 
including  several  profiles  of  geneticisu.  .Sec  also  career 
sections  of  sites  specified  above,  such  as  Access  Excel- 
lence. 

Carolina  Bioloxical  Supply  (.'ompany 

htlp://www(  arost  I  tom/lips  him 

Teaching  materials  for  all  levels.  Includes  mini-lessons 
on  seletU;d  scientific  t<jpics,  two  online  magazines, 
What's  New.  software,  catalogs,  and  publications. 

Cell  &  Molecular  Biology  Online 

http://www.iiai  nn/usrrs/pmgannon 

Linlcs  to  electronic  publications,  current  research,  educa- 
tional and  career  resources,  and  more. 

CKRN  Virtual  Library,  Genetics  section,  BUioclences 
Division 

hiip://www.oml.gov/7'echResources/Human_Cenome/ 
genetics. html 

Includes  an  organism  index  linking  to  other  pertinent 

dalaba.ses,  information  on  the  U  S  and  jnUrrnatiorial  Hij 
man  Genome  Projects,  and  lmk.s  u>  research  sites. 

Clavtic  Papers  in  Genetics 

hnp://www.esp.org 

Covers  the  early  years,  with  introductory  notes  Sec  also 
Access  Excellence  site  above  for  genetics  history. 
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Community  of  Science  Web  Server 

hnp://cos. gdb.org/best.html 

Linlcs  to  Medline,  U.S.  Patent  Citation  Database,  Com- 
merce Business  Daily,  The  Federal  Register,  and  other 
resources. 

Database  of  Genome  Sizes 

hnp:/fyvww.cbs.dtu.dk/databases/DOGS/index.html 

Lists  numerous  organisms  with  genome  sizes,  scientific 
and  common  names,  classifications,  and  references. 

Genetic  and  biological  resources  links 

http://www.er.doe.gov/production/ober/bioinfo_ 
centerhtml 

Genetics  Education  Center,  University  of  Kansas  Medical 
Center 

http://www.kumc.edu/instruction/medicine/genetics/ 
homepage.html 

Educational  information  on  human  genetics,  career  re- 
sources. 

Genetics  Glossary 

htlp://www.oml.gov/hgmis/publicat/glossary.html 

Glossary  of  terms  related  to  genetics. 

Genetics  Webliography 

http://www.dml.georgetown.edu/%7Edavidsol/len.html 

Extensive  links  for  researchers  and  nonscientists  firom 
Georgetown  University  Library. 

Genomics:  A  Global  Resource 

http://www.phrma.org/genomics/index.html 

Many  links.  Website  a  joint  project  of  the  Pharmaceuti- 
cal Research  and  Manufacturers  of  America  and  the 
American  Institute  of  Biological  Sciences;  includes 
Genomics  Today,  a  daily  update  on  the  latest  news  in  the 
field. 

Hispanic  Educational  Genome  Project 

http://yflylab.calstatela.edu/hgp 

Designed  to  educate  high  school  students  and  their  fami- 
lies about  genetics  and  the  Human  Genome  Project. 
Links  to  other  pn-ojects. 

Howard  Hughes  Medical  Institute 

http://www.hhmi.org 

Home  page  of  major  U.S.  philanthropic  organization 
that  supports  research  in  genetics,  cell  biology,  immu- 
nology, structural  biology,  and  neuroscience.  Excellent 
introductory  information  on  these  topics. 


Library  of  Congress 

http:/Acweb.loc.gov/homepage/lchp.html 

Microbial  Database 

http://www.  tigr.org/tdb/mdb/mdb.  html 

Lists  completed  and  in-progress  microbial  genomes, 
with  funding  sources. 

MIT  Biology  Hypertextbook 

http://esg-www.mit.edu:8001/e.<:gbio/700lmain.html 

All  the  basics. 

Science  and  Mathematics  Resources 

http://www:%ci  lib.  uci.  edu 

More  than  2000  Web  references,  including  Frank 
Potter's  Science  Gems  and  Martindale's  Health  Science 
Guide.  For  teachers  at  all  levels. 

Virtual  Courses  on  the  Web 

http://lenti.  med.  umn.  edu/~mwd/courses.  html 

Links  to  Web  tutorials  in  biology,  genetics,  and  more. 

Welch  Web 

http://www.  welch.jhu.edu 

Links  to  many  Internet  biomedical  resources,  dictionaries, 
encyclopedias,  government  sites,  libraries,  and  more,  from 
the  Johns  Hopkins  University  Welch  Library. 

Why  Files 

http://whyfiles.news.wi.ic.edu 

Illustrated  explanations  of  the  science  behind  the  news. 

Images  on  the  Web 

Biochemistry  Online 

http://biochem.arach-net.  com 

Essays,  courses,  3-D  images  of  biomotecules,  modeling, 
software. 

Bugs  in  the  News! 

http://falcon.cc.ukans.edu/~jbrown/hugs.html 

Microbiology  information  and  a  nice  collection  of  im- 
ages of  biological  molecules. 

Cells  Alive! 

http://www.cellsalive.com 

Images  (some  moving)  of  different  types  of  cells. 
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Cn3D  (See  in  3-D) 

hrrp:/Ay  ww3.nchi.nlm.nih.gov/Entrez/Siructure/cn3d.hlml 

3-D  molecular  structure  viewer  allowing  the  user  to  visual- 
ize and  rotate  structure  data  entries  from  Entrez.  Highly 
technical,  for  researchers. 

Cytogenetics  Gallery 

http://www.patholog\.wa.shmgton.edu:80/Cytogallery 

Photos  (karyotypes)  of  normal  and  abnormal  chromo- 
somes. 

DNA  Learning  Center,  Cold  Spring  Harbor  Laboratory 

http://darwin.c.'ihl.org/index.html 

Animated  images  of  PCR  and  Southern  Blotting  tech- 
niques. 

Gene  Map  from  the  1996  Genome  Issue  of  Science 

http://www.nchi.nlm.nih.gov/SCIENCE96 

Click  on  particular  areas  of  chromosomes  and  find  genes. 

Images  of  Biological  Molecules 

http://www.cc. ukan.i.edu/~micro/picts.html 

3-D  structures  of  proteins  and  nucleic  acids  obtained  from 
Brookhaven  National  Laboratory  Protein  Database  and 
others. 

Lawrence  Livermore  National  Laboratory  Chromosome  19 
Physical  Map 

http://www-bio.llnl.gov/bbrp/genome/genome.html 

Los  Alamos  National  Laboratory  Chromosome  16 
Physical  Map 

http://www-l.'i. lanl.gov/DBqueries/QueryPage.html 


Science  Magazine  Genome  Issue  (10/96) 

http://www..sciencemag.org/.'icience/content/vot274/ 
is.sue5287 

Full  text  includes  a  "clickable"  gene  map. 

Science  News 

http://www..'iciencenews.org 

Online  version  of  weekly  popular  science  magazine  with 
full  text  of  selected  articles. 


Journals  and  Magazines 

HGMIS  Journals  Gateways  Web  page 

http://www.oml.gOv/hgmi.-i/link.s.html 

Choose  "Journals,  Books,  Periodicals." 

Biochemistry  and  Molecular  Biology  Journals 

hnp://biochem.  arach-net.  com/beasley/joumals.  html 

Comprehensive  list. 

Nature,  Nature  Genetics,  and  Nature  Biotechnology 

http://w  WW.  nature,  com 

Abstracts  of  articles,  fiill  text  of  letters  and  editorials. 

Science  Magazine 

http://www.  .iciencemag.  org 

Abstracts  and  some  full-text  articles. 


Medical  (ienetics 

Blazing  a  Genetic  Trail 

http://www.hhmi.org/GeneticTrail 

Illustrated  booklet  from  the  Howard  Hughes  Medical 
Institute  on  hunting  for  disease  genes. 

Directory  of  National  Genetic  Voluntary  Organizations 
and  Related  Resources 

http://medhlp.netu.ia.net/ag.ig/ag.igsup.htm 

Support  groups  for  people  with  genetic  diseases  and 
their  families. 

GeneCards 

http://bioinformatics.weizmann.ac.il/cards 

A  database  of  more  than  6000  genes;  describes  their 
functions,  products,  and  biomedical  applications. 

Gene  Therapy 

http://www.mc.vanderbilt.edu/gcrc/gene/index.html 

Web  course  covering  the  basics,  with  links  to  other  sites. 

Inherited-Disease  Genes  Found  by  Positional  Cloning 

http://www.ncbi.nlm.nih.gov/Baxevani/CLONE/ 
index.html 

Links  to  OMIM. 

NIH  Office  of  Recombinant  DNA  Activities 

http://www.nih.gov/od/orda 

Includes  a  database  of  human  gene  therapy  protocols. 

Online  Mendelian  Inheritance  in  Man  (OMIM) 

http://www.ncbi.nlm.nih.gov/Omim 

A  comprehensive,  authoritative,  and  up-to-date  human 
gene  and  genetic  disorder  catalog  that  supports  medical 
genetics  and  the  Human  Genome  Project 
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Promoting  Safe  and  Effective  Genetic  Testing  in  the 
United  States  (1997) 

http://www.med.ihu.edu/tfgtelsi 

Principles  and  recommendations  by  a  joint  NIH-DOE 
Human  Genome  Project  group  that  examined  the  devel- 
opment and  provision  of  gene  tests  in  the  United  States. 

Understanding  Gene  Testing 

http://www.gene.com/ae/AE/AEPC/NIH/index.html 

Illustrated  brochure  from  the  National  Cancer  Institute. 

Science  in  the  News 

EurekAlert!  http://www.eurekalert.org 
InScight:  hnp://www.apnet.com/in.^cight 
SciWeb:  http://www.sciweb.com/news.html 

Short  summaries  of  major  stories,  some  with  links  to 
related  articles  in  other  sources. 

HMS  Beagle 

http://biomednet.com/hmsbeagle 

Biweekly  electronic  journal  featuring  major  science 
stories,  profiles,  book  reviews,  and  other  items  of  interest 

Science  Daily 

http://www.sciencedaily.com 

Headline  stories,  articles,  and  links  to  news  services, 
newspapers,  magazines,  broadcast  sources,  journals,  and 
organizations.  Also  offers  weekly  bulletins  for  updates 
by  e-mail. 


Science  Guide 

http://www..icienceguide.com 

Daily  news  and  information  service  and  free  science 
news  e-maiier.  Also  contains  directories  of  newsgroups, 
grant  and  funding  resources,  employment,  and  online 
journals. 

ScienceNow 

http://www.  sciencenow.  org 

Daily  online  news  service  from  Science  magazine  offers 
articles  on  major  science  news. 

Web  Search  Tools 

Biosciences  Index  to  WWW  Virtual  Library 

http://golgi.harvard.edu/htbin/biopages 

Metacrawler 

http://www.metacrawler.com 

"Search  the  Net" 

http://metro.turnpike.net/adom/search.htmt 

Comprehensive  list  of  search  tools,  libraries,  world  fact 
books,  and  other  useful  information. 

Search.com 

http://www.search.com 

Yahoo! 

http://www.yahoo.com 


\ 
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Appendix  E 
1996  Human  Genome  Research  Projects 


Research  abstracts  of  these  projects  appear  in  Part  2  of  thLs  re{>ort. 


Sequencing 

Advanced  Detectors  for  Mass  Spectrometry 
W.H.  Benner  and  J.M.  Jaklevic 

Lawrence  Berkeley  National  Laboratory,  Berkeley,  California 

Mass  Spectrometer  for  Human  Genome 

Sequencing 

Chung-Hsuan  Chen 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

Genomic  Sequence  Comparisons 
George  Church 

Harvard  Medical  School,  Boston,  Massachusetts 

A  PAC/BAC  End-Sequence  Data  Resource  for 
Sequencing  the  Human  Genome:  A  2- Year  Pilot 
Study 
Pieter  de  Jong 

Roswell  Park  Cancer  Institute,  Buffalo,  New  Yoric 

Multiple-Column  Capillary  Gel  Electrophoresis 
Norman  Dovichi 

University  of  Alberta,  Edmonton,  Canada 

DNA  Sequencing  with  Primer  Libraries 

John  J.  Dunn  and  F.  William  Studier 

Brookhaven  National  Laboratory,  Upton,  New  York 

Rapid  Preparation  of  DNA  for  Automated 

Sequencing 

John  J.  Dunn  and  F.  William  Studier 

Brookhaven  National  Laboratory.  Upton,  New  York 

A  PAC/BAC  End-Sequence  Database  for 
Human  Genomic  Sequencing 

Glen  A.  Evans 

University  of  Texas  Southwestern  Medical  Center,  Dallas,  Texas 

Automated  DNA  Sequencing  by  Parallel  Primer 
Walidng 

Glen  A.  Evans 

University  of  Texas  Southwestern  Medical  Center,  Dallas,  Texas 

^Parallel  Triplex  Formation  as  Possible 
Approach  for  Suppression  of  DNA- Viruses 
Reproduction 
V.L.  Florentiev 

Russian  Academy  of  Sciences,  Moscow,  Russia 


Advanced  Automated  Sequencing  Technology: 
Fluorescent  Detection  for  Multiplex  DNA 
Sequencing 
Raymond  F.  Gesteland 

University  of  Utah,  Salt  Lake  City,  Utah 

Resource  for  Molecular  Cytogenetics 
Joe  Gray  and  Daniel  Pinkel 

University  of  California,  San  Francisco 

DNA  Sample  Manipulation  and  Automation 

Trevor  Hawkins 

Whitehead  Institute  and  Massachusetts  Institute  of  Technol- 
ogy, Cambridge,  Massachusetts 

Construction  of  a  Genome- Wide  Characterized 
Clone  Resource  for  Genome  Sequencing 
Leroy  Hood,  Mark  D.  Adams,'  and  Melvin  Simon' 

University  of  Washington,  Seattle 

'The  Institute  for  Genomic  Research,  RockvUle,  Maryland 

California  Institute  of  Technology,  Pasadena,  California 

DNA  Sequencing  Using  Capillary  Electrophoresis 
Barry  L.  Karger 

Northeastern  University,  Boston,  Massachusetts 

Ultrasensitive  Fluorescence  Detection  of  DNA 
Richard  A.  Mathies  and  Alexander  N.  Glazer 

University  of  California,  Berkeley 

Joint  Human  Genome  Program  Between 
Argonne  National  Laboratory  and  the 
Engelhardt  Institute  of  Molecular  Biology 
Andrei  Mirzabekov 

Argonne  National  Laboratory,  Argonne,  Illinois,  and 
Engelhardt  Institute  of  Molecular  Biology,  Moscow,  Russia 

High-Throughput  DNA  Sequencing:  SAmple 

SEquencing  (SASE)  Analysis  as  a  Framework 

for  Identifying  Genes  and  Complete 

Large-Scale  Genomic  Sequencing 

Robert  K.  Moyzis 

Los  Alamos  National  Laboratory,  Los  Alamos.  New  Mexico 

One-Step  PCR  Sequencing 
Barbara  Ramsay  Shaw 

Duke  University,  Durham,  North  Carolina 


'Projects  designated  by  an  asterisk  were  funded  through  stnall  emergency 
grants  to  Russian  scientists  following  December  1992  site  reviews  by  David 
Galas  (formerly  of  OHER.  renamed  OBER  in  1997).  Raymond  Gesteland 
(University  of  Utah),  and  Elbert  Branscomb  (LLNL). 
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Autoniation  of  the  Front  End  of  DNA  Sequencing 

Lloyd  M.  Smith  and  Richard  A.  Guilfoyle 

University  of  Wisconsin,  Madison 

High-Speed  DNA  Sequence  Analysis  by  Matrix- 
Assisted  Laser  Desorption  Mass  Spectrometry 

Lloyd  M.  Smith 

University  of  Wisconsin,  Madison 

Analysis  of  Oligonucleotide  Mixtures  by 
Electrospray  lonization-Mass  Spectrometry 
Richard  D.  Smith 

Pacific  Northwest  National  Laboratory,  Richland,  Washington 

High-Speed  Sequencing  of  Single  DNA  Mol- 
ecules in  the  Gas  Phase  by  FTICR-MS 
Richard  D.  Smith 

Pacific  Northwest  National  Laboratory,  Richland,  Washington 

Characterization  and  Modification  of  DNA 
Polymerases  for  Use  in  DNA  Sequencing 

Stanley  Tabor 

Harvard  University,  Boston,  Massachusetts 

Modular  Primers  for  DNA  Sequencing 
Levy  Ulanovsky'-^ 

'Argonne  National  Laboratory,  Argonne,  Illinois 
Weizmann  Institute  of  Science,  Rehovot,  Israel 

Time-of-Flight  Mass  Spectroscopy  of  DNA  for 
Rapid  Sequence 
Peter  Williams 

Arizona  State  University,  Tempe,  Arizona 

Development  of  Instrumentation  for  DNA 
Sequencing  at  a  Rate  of  40  Million  Bases  Per  Day 
Edward  S.  Yeung 

Iowa  State  University,  Ames,  Iowa 

Mapping 

Resolving  Proteins  Bound  to  Individual  DNA 

Molecules 

David  Allison  and  Bruce  Warmack 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

^Improved  Cell  Electrotransformation  by 

Macromolecules 

Alexandre  S.  Boitsov 

St.  Petersburg  State  Technical  University,  St.  Petersburg,  Russia 
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Overcoming  Genome  Mapping  Bottlenecks 
Charles  R.  Cantor 

Boston  University,  Boston,  Massachusetts 

Preparation  of  PAC  Libraries 
Pieter  J.  de  Jong 

Roswell  Park  Cancer  Institute,  Buffalo,  New  York 

Chromosomes  by  Third-Strand  Binding 
Jacques  R.  Fresco 

Princeton  University,  Princeton,  New  Jersey 

Chromosome  Region-Specific  Libraries  for 
Human  Genome  Analysis 
Fa-Ten  Kao 

Eleanor  Roosevelt  Institute  for  Cancer  Research,  Denver, 
Colorado 

^Identification  and  Mapping  of  DNA-Binding 
Proteins  Along  Genomic  DNA  by  DNA-Protein 
Crosslin  king 
V.L.  Karpov 

Engelhardt  Institute  of  Molecular  Biology,  Russian  Academy 
of  Sciences,  Moscow,  Russia 

A  PAC/BAC  Data  Resource  for  Sequencing 
Complex  Regions  of  the  Human  Genome: 
A  2- Year  Pilot  Study 
Julie  R.  Korenberg 

Cedars  Sinai  Medical  Center,  Los  Angeles,  California 

Mapping  and  Sequencing  of  the  Human 
X  Chromosome 

D.  L.  Nelson 

Baylor  College  of  Medicine,  Houston,  Texas 

*Sequence-Specific  Proteins  Binding  to  the 

Repetitive  Sequences  of  High  Eukaryotic 

Genome 

Olga  Podgornaya 

Institute  of  Cytology,  Russian  Academy  of  Sciences, 
St.  Petersburg,  Russia 

♦Protein-Binding  DNA  Sequences 
O.L.  Polanovsky 

Engelhardt  Institute  of  Molecular  Biology,  Russian  Academy 
of  Sciences,  Moscow,  Russia 
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*Developinent  of  Intracellular  Flow  Karyotype 

Analysis 

A.I.  Poletaev 

Engelhardt  Institute  of  Molecular  Biology,  Russian  Academy 
of  Sciences,  Moscow,  Russia 

Mapping  and  Sequencing  with  BACs  and 

Fosmids 

Melvin  I.  Simon 

California  Institute  of  Technology,  Pasadena,  California 

Towards  a  Globally  Integrated, 
Sequence-Ready  BAG  Map  of  the  Human 
Genome 
MelTin  I.  Simon 

California  Instimte  of  Technology,  Pasadena,  California 

Generation  of  Normalized  and  Subtracted 
cDNA  Libraries  to  Facilitate  Gene  Discovery 
Marcelo  Bento  Soares 

Columbia  University,  New  York,  New  York 

Mapping  in  Man-Mouse  Homology  Regions 

Lisa  Stubbs 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

Positional  Cloning  of  Murine  Genes 

Lisa  Stubbs 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

Human  Artificial  Episomal  Chromosomes 
(HAECS)  for  Building  Large  Genomic  Libraries 
Jean-Michel  H.  Vos 

University  of  North  Carolina,  Chapel  Hill 

♦Cosmid  and  cDNA  Map  of  a  Human 
Chromosome  13ql4  Region  Frequently  Lost 
at  B  Cell  Chronic  Lymphocytic  Leukemia 

N.K.  Yankovsky 

N.I.  Vavilov  Institute  of  General  Genetics,  Moscow,  Russia 

Informatics 

BCM  Server  Core 
Daniel  Davison 

Baylor  College  of  Medicine,  Houston,  Texas 


A  Freely  Sharable  Database-Management 
System  Designed  for  Use  in  Component-Based, 
Modular  Genome  Informatics  Systems 
Nathan  Goodman 

The  Jackson  Laboratory,  Bar  Harbor,  Maine 

A  Software  Environment  for  Large-Scale 

Sequencing 

Mark  Graves 

Baylor  College  of  Medicine,  Houston,  Texas 

Generalized  Hidden  Markov  Models  for 
Genomic  Sequence  Analysis 
David  Haussler 

University  of  California,  Santa  Cruz 

Identification,  Organization,  and  Analysis  of 
Mammalian  Repetitive  DNA  Information 

Jerzy  Jurka 

Genetic  Information  Research  Institute,  Palo  Alto,  California 

*TRRD,  GERD  and  COMPEL:  Databases  on 
Gene-Expression  Regulation  as  a  Tool  for 
Analysis  of  Functional  Genomic  Sequences 

N.A.  Kolchanov 

Institute  of  Cytology  and  Genetics,  Novosibirsk,  Russia 

Data-Management  Tools  for  Genomic  Databases 

Victor  M.  Markowitz  and  l-Min  A.  Chen 

Lawrence  Berkeley  National  Laboratory,  Berkeley,  California 

The  Genome  Topographer:  System  Design 
T.  Man- 
Cold  Spring  Harbor  Laboratory,  Cold  Spring  Harbor, 
New  York 

A  Flexible  Sequence  Reconstructor  for 
Large-Scale  DNA  Sequencing:  A  Customizable 
Software  System  for  Fragment  Assembly 
Gene  Myers 

University  of  Arizona,  Tucson 

The  Role  of  Integrated  Software  and  Databases 
in  Genome  Sequence  Interpretation  and 
Metabolic  Reconstruction 
Ross  Overbeek 

Argonne  National  Laboratory,  Argonne,  Illinois 
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Database  IVansformations  for  Biological 

Applications 

G.  Christian  Overton,  Susan  B.  Davidson,  and 

Peter  Buneman 

University  of  Pennsylvania,  Ptiiladelphia 

Las  Vegas  Algorithm  for  Gene  Recognition: 
Suboptimal  and  Error- Tolerant  Spliced 
Alignment 
Pavel  A.  Pevzner 

University  of  Southern  California,  Los  Angeles,  Califortua 

Foundations  for  a  Syntactic  Pattern- 
Recognition  System  for  Genomic  DNA 
Sequences:  Languages,  Automata,  Interfaces, 
and  Macromolecules 
David  B.  Searls 

SmithKline  Beecham  Pharmaceuticals,  King  of  Prussia, 
Pennsylvania 

Analysis  and  Annotation  of  Nucleic  Acid 

Sequence 

David  J.  SUtes 

Washington  University,  Sl  Louis,  Missouri 

Gene  Recognition,  Modeling,  and  Homology 
Search  in  GRAIL  and  genQuest 
Edward  C.  Uberbacher 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

Informatics  Support  for  Mapping  in 
Mouse- Human  Homology  Regions 
Edward  Uberbacher 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

SubmitData:  Data  Submission  to  Public 
Genomic  Databases 
Manfred  D.  2U)m 

Lawrence  Berkeley  National  Laboratory,  University  of 
California,  Berkeley 

ELSI 

The  Human  Genome:  Science  and  the  Social 
Consequences;  Interactive  Exhibits  and  Pro- 
grams on  Genetics  and  the  Human  Genome 
Charles  C.  Carlson 

The  Exploratorium,  San  Francisco,  California 


Documentary  Series  for  Public  Broadcasting 
Graham  Chedd  and  Noel  Schwerin 

Chedd-Angier  Production  Company,  Watertown, 
Massachusetts 

Human  Genome  Teacher  Networking  Project 
Debra  L.  Collins  and  R.  Neil  Schimke 

University  of  Kansas  Medical  Center,  Kansas  City,  Kansas 

Human  Genome  Education  Program 
Lane  Conn 

Stanford  Human  Genome  Center,  Palo  Alto,  California 

Your  World/Our  World-Biotechnology  &  You: 
Special  Issue  on  the  Human  Genome  Project 
Jeff  Davidson  and  Laurence  Weinberger 

Pennsylvania  Biotechnology  Association,  State  College, 
Pennsylvania 

The  Human  Genome  Project  and  Mental 
Retardation:  An  Educational  Program 

Sharon  Davis 

The  Arc  of  the  United  States,  Arlington,  Texas 

Pathways  to  Genetic  Screening:  Molecular 
Genetics  Meets  the  High-Risk  Family 
Troy  Duster 

University  of  California,  Berkeley 

Intellectual  Property  Issues  in  Genomics 

Rebecca  S.  Eisenberg 

University  of  Michigan  Law  School,  Ann  Arbor,  Michigan 

AAAS  Congressional  Fellowship  Program 
Stephen  Goodman 

The  American  Society  of  Human  Genetics,  Bethesda, 
Maryland 

A  Hispanic  Educational  Program  for  Scientific, 
Ethical,  Legal,  and  Social  Aspects  of  the  Human 
Genome  Project 
Margaret  C.  Jefferson  and  Mary  Ann  Sesma' 

California  State  University  and  'Los  Angeles  Unified  School 
District,  Los  Angeles,  California 

Implications  of  the  Genetidzation  of  Health 
Care  for  Primary  Care  Practitioners 
Mary  B.  Mahowald 

University  of  Chicago,  Chicago,  Illinois  . 
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Nontraditional  Inheritance:  Genetics  and  the 
Nature  of  Science;  Instructional  Materials  for 
High  School  Biology 
Joseph  D.  Mclnemey  and  B.  EUen  Friedman 

Biological  Sciences  Curriculum  Study,  Colorado  Springs, 
Colorado 

The  Human  Genome  Project:  Biology, 
Computers,  and  Privacy:  Development  of 
Educational  Materials  for  High  School  Biology 
Joseph  D.  Mclnerney  and  Lynda  B.  Micikas 
Biological  Sciences  Curriculum  Study,  Colorado  Springs, 
Colorado 

Involvement  of  High  School  Students  in  Se- 
quencing the  Human  Genome 
Maureen  M.  Munn.  Maynard  V.  Olson,  and  Leroy  Hood 

University  of  Washington,  Seattle 

The  Gene  Letter:  A  Newsletter  on  Ethical,  Legal, 
and  Social  Issues  in  Genetics  for  Interested 
Professionals  and  Consumers 
PhiUp  J.  ReUly,  Dorothy  C.  Wertz,  and  Robin  J.R.  Blatt 

The  Shriver  Center  for  Mental  Retardation,  Waltham, 
Massachusetts 

The  DNA  Files:  A  Nationally  Syndicated  Series 
of  Radio  Programs  on  the  Social  Implications  of 
Human  Genome  Research  and  Its  Applications 
Ban  Scott 

Genome  Radio  Project,  KPFA-FM,  Berkeley,  California 

Communicating  Science  in  Plain  Language: 

The  Science-i-  Literacy  for  Health:  Human 

Genome  Project 

Maria  Sosa,  Judy  Kass,  and  Tracy  Gath 

American  Association  for  the  Advancement  of  Science, 

Washington,  D.C. 

The  Community  College  Initiative 

Sylvia  J.  Spengler  and  Laurel  Egenberger 

Lawrence  Berkeley  National  Laboratory,  Berkeley,  California 

Genome  Educators 

Sylvia  Spengler  and  Janice  Mann 

Lawrence  Berkeley  National  Laboratory,  Berkeley,  California 


Getting  the  Word  Out  on  the  Human  Genome 
Project:  A  Course  for  Physicians 

Sara  L.  Tobin  and  Ann  Boughton' 

Stanford  University,  Palo  Alto,  California 
'Thumbnail  Graphics,  Oklahoma  City,  Oklahoma 

The  Genetics  Adjudication  Resource  Project 
Franklin  M.  Zweig 

Eiiutein  Institute  for  Science,  Health,  and  the  Courts, 
Bethesda,  Maryland 

Infrastructure 

Alexander  HoUaender  Distinguished 

Postdoctoral  Fellowships 

Linda  Holmes  and  Eugene  Spejewski 

Oak  Ridge  Institute  for  Science  and  Education,  Oak  Ridge, 

Tennessee 

Human  Genome  Management  Information 

System 

Betty  K.  Mansfield  and  John  S.  Wassom 

Oak  Ridge  National  Laboratory,  Oak  Ridge,  Tennessee 

Human  Genome  Program  Coordination 

Sylvia  J.  Spengler 

Lawrence  Berlceley  National  Laboratory,  Berkeley,  California 

Support  of  Human  Genome  Program  Proposal 
Reviews 

Walter  WUliams 

Oak  Ridge  Institute  for  Science  and  Education,  Oak  Ridge, 
Tennessee 

Former  Soviet  Union  Office  of  Health  and 
Environmental  Research  Program 
James  Wright 

Oak  Ridge  Institute  for  Science  and  Education,  Oak  Ridge, 
Tennessee 

SBIR 

1996  Phase  I 

An  Engineered  RNA/DNA  Polymerase  to 
Increase  Speed  and  Economy  of  DNA 
Sequencing 

Mark  W.  Knuth 

Promega  Corporation,  Madison,  Wisconsin 
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Directed  Multiple  DNA  Sequencing  and 
Expression  Analysis  by  Hybridization 
Giialberto  Ruano 

BIOS  Laboratories,  Inc.,  New  Haven,  Connecticut 

1996  Phase  II 

A  Graphical  Ad  Hoc  Query  Interface  Capable 
of  Accessing  Heterogeneous  Public  Genome 
Databases 
Joseph  Leone 

CyberConnect  Corporation,  Storrs,  Connecticut 


Low-Cost  Automated  Preparation  of  Plasmid, 
Cosmid,  and  Yeast  DNA 
William  P.  MacConnell 

MacConnell  Research  Corporation,  San  Diego,  California 

GRAH^- Gen  Quest:  A  Comprehensive 

Computational  Framework  for  DNA  Sequence 

Analysis 

RuUi  Ann  Manning 

ApoCora.  Inc.,  Oak  Ridge,  Tennessee 
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Appendix  F:  DOE  BER  Program 


Text  and  phoUn  in  this  appendix  first  appeared  in  a  brvchurr 
prepared  by  the  Human  Genome  Management  Information 
System  for  the  DOE  Office  of  Biological  and  Environmental 
Research  to  announce  a  symposium  celthrating  SO  years  of 
achievements  in  the  Hiological  and  Environmental  Research 
Program.  "Serving  Science  and  Society  into  the  New 
Millennium"  was  held  on  May  21-22. 1997.  at  the  National 
Academy  of  Sciences  in  Washington.  O.C.  The  color 
brochure  and  other  recent  publications  related  to  BER 
research,  including  the  historically  comprehensive  A  Vital 
Legacy,  may  be  obtained  from  HGMIS  at  the  address  on  the 
inside  front  cover. 
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Biologic^  and  Envtranmentaf  Ros^ch  Progrmm 

Arlft(td«s  Patrinoa  Ph  D 

Associate  Dirvcfor  for  Erorgy  Retitarch 

for  the 

OtticB  of  Biologtcat  and  Envifonm«mal  R»««vch 

U  S  Ottpsnmsnt  of  Energy 

301/903-3251    Fa*   301^903-5051 


DOE  Biological  and 
Environmental  Research 
Program 

An  Extraordinary  Legacy 

To  L'\plou  ihc  h-^ijiiiilc-.^  promise  ot  cnCTg>  icthnologii's  .uid  >.ht;d 
light  on  ihcH  .   'i    uu.  (ui-<;  lo  public  hcal'.h  and  ihc  cnviroimwD!. 
thic  Biological  jiid  i;iiii!omncnlal  Rcsciirch  progntm  ol  ihc  t'  S 
Dcp.-irmif!ii  Hi  fincrgys  iDOti  0!li'.c  n(  Hi;;illh  .tiul 
Enviroomcmai  Research  (OHERl  has  ctjgagcd  in  a  vancly  of 
muludLSciplinary'  rcse;uvh  acliviticN 
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UwaunenLs  for  human  Jlscjsc 

•  Assessing  the  health  effects  of  radiation. 
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National  User  Facilities 

l>edkat«!<l  biomedkal  rcMiurcrs,  such  as 
those  maintainfd  hv  Kl-  R  a!  >«'ver»l  DtlE 
laboratorto.  art  dv  ailahte  al  littlf  iir  no 
charge.  Thes«  revmint^s  rnable  •rtimli'its 
to  gain  an  nndfr^tandlng  of  retatkinshipsi 
bf iween  bUiliigical  struitnns*  and  their 
functiuns,  study  dii>ea.se  prwessc, 
develop  n»»  pharmaceuticals,  ami 
conduct  bask  research  m  molecular 
biolti^v  and  environmental  proctsss. 


William  R.  \VdeN  F:nviri>iimenUd  Molecular  Sciemo  I  iil>ornti>ri    (MM     i^  .i 
nalioDal  mliaborative  UM-r  facility  for  pnividini!  iniMivalive  approaches  to  meet 
the  net-its  of  l>Of   >  environmental  mKsion!-- 
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An  Enduring  Mandate 

DOE  is  carrying  forward  Congressional  mandates  that  began 
with  its  predecessors,  the  Atomic  Energy  Commission  and  the 
Energy  Research  and  Development  Agency: 

Contribute  to  a  Healthy  Citizenry 

•  Develop  innovative  technologies  for  tomorrow's 
biomedical  sciences. 

•  Provide  the  basis  for  individual  risk  assessments  by 
determining  the  human  genome's  fine  structure  by  the 
year  2005. 

•  Conduct  research  into  advanced  medical  technologies 
and  radiopharmaceuticals. 

•  Build  and  support  national  user  facilities  for 
determining  biological  structure,  and  ultimately 
function,  at  the  molecular  and  cellular  level. 


Understand  Global  Climate 
Change 

Predict  the  effects  of  energy  production  and  its  use  on  the 
regional  and  global  environment  by  acquiring  data  and 
developing  the  necessary  understanding  of  environmental 
processes. 


Contribute  to  Environmental 
Cleanup 

Conduct  fundamental  research  to  establish  a  better 
scientific  basis  for  remediating  contaminated  sites. 


DOE  user  facilities  are  revealing  the  molecular  details  of 
life.  Knowing  the  3-D  structure  of  the  ras  protein  (above), 
an  important  molecular  switch  governing  human  cell 
growth,  will  enable  interventions  to  shut  off  this  switch  in 
cancer  cells. 


.l,.,iJSt 


Determining  the  fine  structure— DNA  sequence — of  the 
microorganism  Methanococcus  jannaschii  (pictured  at  right, 
top)  and  other  minimal  life  forms  In  DOE's  Microbial 
Genome  Program  will  benefit  medicine,  agriculture, 
industrial  and  energy  production,  and  environmental 
bloremedlatlon.  The  circular  representation  of  the  single 
M.  jannaschii  chromosome,  which  was  fiiUy  sequenced  In 
1996,  illustrates  the  location  of  genes  and  other  important 
features.  (Vertical  bar  represents  a  portion  of  a  sequencing 
experiment) 


p  % 


,,^V^v 
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Fifty  Years  of  Achievements, , . 

Leading  to  Innovative  Solutions 


Tools  for  Medicine  and  Research 

Radioisotopes  developed  for  medicine  and  medical  imaging  are 
being  merged  with  current  knowledge  in  biology  and  genetics  to 
discover  new  ways  of  diagnosing  and  treating  cancer  and  other 
disorders,  detecting  genes  in  action,  and  understanding  normal 
development  and  function  of  human  organ  systems. 

•  Radioactive  molecules  used  in  medical  imaging  for  positron 
emission  tomography  (PET)  and  magnetic  resonance  imaging 
(MRI)  allow  noninvasive  diagnosis,  monitoring,  and 
exploration  of  human  disorders  and  their  treatments. 

•  Isotopes  and  other  tracers  of 
brain  activity  are  being  used  to 
explore  drug  addiction,  the 
effects  of  smoking, 
Alzheimer's  disease, 
Parkinson's  disease,  and 
schizophrenia. 

•  Technetium-99m  is  used  to 
diagnose  diseases  of  the 
kidney,  liver,  heart,  brain,  and 
other  organs  in  about 
13  million  patients  per  year. 

•  Striking  successes  have  been 
achieved  using  charged  atomic 
particles  to  treat  thyroid  diseases, 
pituitary  tumors,  and  eye  cancer, 
among  other  disorders. 


Genome  Projects 

A  legacy  of  DOE  research  on  genetic 
effects  paved  the  way  for  the  world's 
first  Human  Genome  Program.  Now  new 
genomic  technologies  are  being  applied 
to  environmental  cleanup  through  the 
DOE  Natural  and  Accelerated 
Bioremediation  Research  and  Microbial 
Genome  programs,  healthcare  and  risk 
assessment,  and  such  other  national 
priorities  as  industrial  processes  and 
agriculture. 


One-quarter  of  all  patients  in  U.S. 
hospitals  undergo  tests  using  descendants 
of  cameras  developed  by  BER  to  follow 
radioactive  tracers  in  the  lM>dy.  PET 
scanning  has  been  key  to  a  generation  of 
brain  metabolism  studies  as  well  as 
diagnostic  tests  for  heart  disease  and 
cancer.  PET  studies  al>ove  reveal  brain 
metabolism  differences  in  recovering 
alcoholics  (left,  10  days,  and  right, 
30  days,  after  withdrawal  from  alcohol). 


The  laser-based  flow 
cytometer  developed  at 
DOE  national 
lalntratories  enables 
researchers  to  separate 
human  chromosomes 
for  analysis. 
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Discover  the  breadth  of  current  activities  and  recent  accomplishments  via  the  BER  Web  Site: 

http://www.er.doe.gov/pToductionJober/oberJop.html 
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Radiation  Risks  and  Protection  Guidelines 

BER  studies  have  become  the  foundation  for  laws  and 
standards  that  protect  the  population,  including  workers 
exposed  to  radiological  sources; 

•  Guidelines  for  the  safe  use  of  diagnostic  X  rays  and 
radiopharmaceuticals. 

•  Safety  standards  for  the  presence  of  radionuclides  in 
food  and  drinking  water. 

•  Radiation-detection  systems  and  dosimetry 
techniques. 


Finding  a  Link  Between  DNA  Damage 
and  Cancers 

Studies  of  DNA  damage  have  uncovered  similar 
mechanisms  at  work  in  damage  caused  by  radiation 
exposure,  X  rays,  ultraviolet  light,  and  cancer-causing 
chemicals.  A  screening  test  for  such  chemicals  is  now 
one  of  the  first  hurdles  a  new  compound  must  clear  on 
it^  way  to  regulatory  and  public  acceptance. 


Tracking  the  Regional  and  Global 
Movement  of  Pollutants 

BER  research  helped  to  establish  the  earliest  and  most 
authoritative  monitoring  network  in  the  world  to 
detect  airborne  radioisotopes.  The  use  of  atmospheric 
tracers  has  led  to  the  improved  ability  to  predict  the 
dispersion  of  pollutants. 


Understanding  Global  Change 

Important  achievements  in  environmental  research 
have  led  to  enhanced  capabilities  in  studying  global 
change,  including  more  accurate  predictions  of 
global  and  regional  climate  changes  induced  by 
increasing  atmospheric  concentrations  of 
greenhouse  gases. 


Human  chromosomes  "painted"  by  fluorescent  dyes  to  detect 
abnormal  exchange  of  genetic  material  frequently  present  in 
cancer.  Chromosome  paints  also  serve  as  valuable  resources  for 
other  clinical  and  research  applications. 

4  %  •  •  •  (it's)  not  SO  much  where  we  stand 
as  in  what  direction  we  are  moving. 

[Oliver  Wendell  Homes,  Sr.J  J  J 


High. 

performance 
computing  is 
promoting 
faster  and 
more  realistic 
solutions  to 
long-term 
climate  change. 


The  Unmanned  Aerospace  Vehicle  (above)  conducts 
measurements  to  quantify  the  fate  of  solar  radiaUoD  falling  on 
the  earth. 


Creating  a  New  Science  of  Ecology 

BER  achievements  in  using  radioactive  tracers  to  follow 
the  movements  of  animals,  routes  of  chemicals  through 
food  chains,  decomposition  of  forest  detritus,  together 
with  the  program's  introduction  of  computer  simulations, 
created  the  new  field  of  ladioecology . 
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Glossary 


This  glos$JU7  was  adapted  from  definitions  in  llie  DOE 
Primer  on  MoUeular  Gtnetici  (1992). 


Adenine  (A):  A  nitrogenous  base,  one  member  of  the  base 
pair  A-T  (adenine-thymine). 

Allele:  Alternative  form  of  a  genetic  locus;  a  single  allele  for 
each  locus  is  inherited  separately  from  each  parent  (e.g..  at  a 
locus  for  eye  color  the  allele  might  result  in  blue  or  brown 
eyes). 

Amino  add:  Any  of  a  class  of  20  molecules  that  are  com- 
bined to  form  proteins  in  living  things.  The  sequence  of 
amiiK)  acids  in  a  protein  and  hence  protein  function  are  deter- 
mined by  tbe  genetic  code. 

Amplification:  An  increase  in  the  number  of  copies  of  a  spe- 
cific DNA  fragment;  can  be  in  vivo  or  in  vitro.  See  cloning, 
polymeiase  chain  reaction. 

Arrayed  library:  Individual  primary  recombinant  clones 
(hosted  in  phage,  cosmid,  YAC,  or  other  vector)  that  are 
placed  in  two-dimensional  arrays  in  microtiter  dishes.  Each 
primary  clone  can  be  identified  by  the  identity  of  the  plate 
and  the  clone  location  (row  and  column)  on  that  plate.  Ar- 
rayed libraries  of  clones  can  be  used  for  many  applications, 
including  screening  for  a  specific  gene  or  genomic  region  of 
interest  as  well  as  for  physical  mapping-  Information  gath- 
ered on  individual  clones  from  various  genetic  linkage  and 
physical  map  analyses  is  entered  into  a  relational  database 
and  used  to  construct  physical  and  genetic  linkage  maps  si- 
multaneously; clone  identifiers  serve  to  inteirelate  the  multi- 
level maps.  Compare  library,  genomic  library. 

Autoradiography:  A  technique  that  uses  X-ray  film  to  visu- 
alize radioactivcly  labeled  molecules  or  fragments  of  mol- 
ecules; used  in  analyzing  length  and  number  of  DNA  frag- 
ments after  they  are  separated  by  gel  electrophoresis. 

Autosome:  A  chromosome  not  involved  in  sex  determina- 
tion. The  diploid  human  genome  consists  of  46  chromo- 
somes, 22  pairs  of  autosomes,  and  I  pair  of  sex  chromo- 
somes (tbe  X  and  Y  chromosomes). 


B 


BAC:  See  bacterial  artificial  chromosome. 

Bacterial  artificial  chromosome  (BAC):  A  vector  used  to 
clone  DNA  fragments  (100-  to  300-kb  insert  size;  average, 
ISO  kb)  in  Eschtrichia  coli  cells.  Based  on  naturally  occur- 
ring F-factor  plasmid  found  in  the  bacterium  £.  coli.  Com- 
pare cloning  vector. 


Bacteriophage:  See  phage. 

Base  pair  (bp):  Two  nitrogenous  bases  (adenine  and  thym- 
ine or  guanine  and  cytosinc)  held  together  by  weak  bonds. 
Two  strands  of  DNA  are  held  together  in  the  shape  of  a 
double  helix  by  the  bonds  between  base  pairs. 

Base  sequence:  The  order  of  nucleotide  bases  in  a  DNA 
molecule. 

Base  sequence  analysis:  A  method,  sometimes  automated, 
for  determining  the  base  sequence. 

Biotechnology:  A  set  of  biological  techniques  developed 
through  basic  research  and  now  applied  to  research  and  prod- 
uct development.  In  particular,  tbe  use  by  industry  of  recom- 
binant DNA,  cell  fusion,  and  new  bioprocessing  techniques. 

bp:  See  base  pair. 


cDNA:  See  complementary  DNA. 

Centimotsan  (cM):  A  unit  of  measure  of  recombination  fre- 
quency. One  centimoigan  is  equal  to  a  1%  chance  that  a 
marker  at  one  genetic  locus  will  be  separated  from  a  marlcer 
at  a  second  locus  due  to  crossing  over  in  a  single  generation. 
In  human  beings,  I  cenlimorgan  is  equivalent,  on  average,  to 
I  million  base  pairs. 

CenlTonwrc:  A  specialized  chromosome  region  to  which 
spindle  fibeis  attach  dining  cell  division. 

Chromosome:  The  self -replicating  genetic  structure  of  cells 
containing  the  cellular  DNA  that  bears  in  its.  nucleotide  se- 
quence the  linear  array  of  genes.  In  prokaryotes,  chromo- 
somal DNA  is  circular,  and  the  entire  genome  is  carried  on 
one  chromosome.  Eulcaryotic  genomes  consist  of  a  number 
of  chromosomes  whose  DNA  is  associated  with  different 
kinds  of  proteins. 

Clone  bank:  See  genomic  library. 

Clone:  A  group  of  cells  derived  from  a  single  ancestor. 

Ooning:  The  process  of  asexually  producing  a  group  of 
cells  (clones),  all  genetically  identical,  from  a  single  ances- 
tor In  recombinant  DNA  technology,  the  use  of  DNA  ma- 
nipulation procedures  lo  produce  multiple  copies  of  a  single 
gene  or  segment  of  DNA  is  referred  to  as  cloning  DNA. 
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Cloning  vector:  DNA  molecule  originating  firom  a  vims,  a 
plasmid,  or  the  cell  of  a  higher  organism  into  which  another 
DNA  fragment  of  appropriate  size  can  be  integrated  without 
loss  of  the  vectors  capacity  for  self-replication;  vectors  intro- 
duce foreign  DNA  into  host  cells,  where  it  can  be  reproduced 
in  large  quantities.  Examples  are  plasmids,  cosmids,  and 
yeast  artificial  chromosomes;  vectors  are  often  recombinant 
molecules  containing  DNA  sequences  ftom  several  sources. 

cM:  See  centimorgan. 

Code:  See  genetic  code. 

Codon:  See  genetic  code. 

Complementary  DNA  (cDNA):  DNA  that  is  synthesized 
from  a  messenger  RNA  template;  the  single-stranded  form  is 
often  used  as  a  probe  in  physical  mapping. 

Complementary  sequence:  Nucleic  acid  base  sequence  that 
can  form  a  double-stranded  structure  by  matching  base  pairs 
with  another  sequence;  the  complementary  sequence  to 
G-T-A-C  is  C  A  T-G. 

Conserved  sequence:  A  base  sequence  in  a  DNA  molecule 
(or  an  amino  acid  sequence  in  a  protein)  that  has  remained 
essentially  unchanged  throughout  evolution. 

Contig:  Group  of  clones  representing  overlapping  regions  of 
a  genome. 

Contig  map:  A  map  depicting  the  relative  order  of  a  linked 
library  of  small  overlapping  clones  representing  a  complete 
chromosomal  segment 

Cosmid:  Artificially  constructed  cloning  vector  containing 
the  cos  gene  of  phage  lambda.  Cosmids  can  be  packaged  in 
lambda  phage  particles  for  infection  into  £.  colt;  this  permits 
cloning  of  larger  DNA  fragments  (up  to  45  kb)  than  can  be 
introduced  into  bacterial  hosts  in  plasmid  vectors. 

Crossing  over  The  breaking  during  meiosis  of  one  maternal 
and  one  piatemal  chromosome,  the  exchange  of  correspond- 
ing sections  of  DNA,  and  the  rejoining  of  the  chromosomes. 
This  process  can  result  in  an  exchange  of  alleles  between 
chromosomes.  Compare  recombination. 

Cytosine  (C):  A  nitrogenous  base,  one  member  of  the  base 
pair  G-C  (guanine  and  cytosine). 


D 

Deoxyribonudeotkie:  See  nucleotide. 
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Diploid:  A  full  set  of  genetic  material,  consisting  of  paired 
chromosomes  one  chromosome  from  each  parental  set.  Most 
animal  cells  except  the  gametes  have  a  diploid  set  of  chro- 
mosomes. The  diploid  human  genome  has  46  chromosomes. 
Compare  haploid. 

DNA  (deoxyribonucleic  add):  The  molecule  that  encodes 
genetic  information.  DNA  is  a  double-stranded  molecule 
held  together  by  weak  bonds  between  base  pairs  of  nucle- 
otides. The  four  nucleotides  in  DNA  contain  the  bases;  ad- 
enine (A),  guanine  (G),  cytosine  (C),  and  thymine  (T).  In 
nature,  base  pairs  form  only  between  A  and  T  and  between  G 
and  C;  thus  the  base  sequence  of  each  single  strand  can  be 
deduced  from  that  of  its  partner. 

DNA  probe:  See  probe. 

DNA  replication:  The  use  of  existing  DNA  as  a  template  for 
the  synthesis  of  new  DNA  strands.  In  humans  and  other  eu- 
karyotes,  replication  occurs  in  the  cell  nucleus. 

DNA  sequence:  The  relative  order  of  base  pairs,  whether  in 
a  fragment  of  DNA,  a  gene,  a  chromosome,  or  an  entire  ge- 
nome. See  base  sequence  analysis. 

Domain:  A  discrete  portion  of  a  protein  with  its  own  func- 
tion. The  combination  of  domains  in  a  single  protein  deter- 
mines its  overall  function. 

Double  helix:  The  shape  that  two  linear  strands  of  DNA  as- 
sume when  bonded  together. 


E 


E.  coU:  Common  bacterium  that  has  been  studied  intensively 
by  geneticists  because  of  its  small  genome  size,  normal  lack 
of  pathogenicity,  and  ease  of  growth  in  the  laboratory. 

Electrophoresis:  A  method  of  separating  large  molecules 
(such  as  DNA  fragments  or  proteins)  from  a  mixture  of  simi- 
lar molecules.  An  electric  current  is  passed  through  a  me- 
dium containing  the  mixture,  and  each  kind  of  molecule  trav- 
els through  the  medium  at  a  different  rate,  depending  on  its 
electrical  charge  and  size.  Separation  is  based  on  these  differ- 
ences. Agarose  and  acrylamide  gels  are  the  media  commonly 
used  for  electrophoresis  of  proteins  and  nucleic  acids. 

Endonudease:  An  enzyme  that  cleaves  its  nucleic  acid  sub- 
strate at  internal  sites  in  the  nucleotide  sequence. 

Enzyme:  A  protein  that  acts  as  a  catalyst,  speediqg  the  rate  at 
which  a  biochemical  reaction  proceeds  but  not  altering  the 
direction  or  nature  of  the  reaction. 
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EST:  Expressed  sequence  Ug.  See  sequence  tagged  site. 

Eukaryote:  Cell  or  organism  with  membrane-bound,  struc- 
turally discrete  nucleus  and  other  well-developed  subcellular 
compartments.  Eukaryotes  include  all  organisms  except 
viruses,  bacteria,  and  blue-green  algae.  Compare  prokaryote. 
See  chromosome. 

Evolutionarily  conserved:  See  conserved  sequence. 

Exogenous  DNA:  DNA  originating  outside  an  organism. 

Exon:  The  protein-coding  DNA  sequence  of  a  gene.  Com- 
pare intron. 

ExonudeKc:  An  enzyme  that  cleaves  nucleotides  sequen- 
tially from  free  ends  of  a  linear  nucleic  acid  substrate. 

Expressed  gene:  See  gene  expression. 


FISH  (fluorescence  in  situ  hybridization):  A  physical  map- 
ping approach  that  uses  fluorescein  tags  to  detect  hybridiza- 
tion of  probes  with  metaphase  chromosomes  and  with  the 
less-condensed  somatic  interphase  chromatin. 

Flow  cytometry:  Analysis  of  biological  material  by  detec- 
tion of  the  light-absorbing  or  fluorescing  properties  of  cells 
or  subcellular  fractions  (i.e.,  chromosomes)  passing  in  a  nar- 
row stream  through  a  laser  beam.  An  absorbance  or  fluores- 
cence profile  of  the  sample  is  produced.  Automated  sorting 
devices,  used  to  fractionate  samples,  sort  successive  droplets 
of  the  analyzed  stream  into  different  fractions  depending  on 
the  fluorescence  emitted  by  each  droplet. 

Flow  karyotyping:  Use  of  flow  cytometry  to  analyze  and 
separate  chromosomes  on  the  basis  of  their  DNA  content. 


Gamete:  Mature  male  or  female  reproductive  cell  (sperm  or 
ovum)  with  a  haploid  set  of  chromosomes  (23  for  humans). 

Gene:  The  fundamental  physical  and  functional  unit  of  he- 
redity. A  gene  is  an  ordered  sequence  of  nucleotides  located 
in  a  particular  position  on  a  particular  chromosome  that  en- 
codes a  specific  functional  product  (i.e.,  a  protein  or  RNA 
molecule).  See  gene  expression. 


Gene  cxpressioii:  The  process  by  which  a  gene's  coded  in- 
formation is  converted  into  the  structures  present  and  operat- 
ing in  the  cell.  Expressed  genes  include  those  that  are  tran- 
scribed into  mRNA  and  then  translated  into  protein  and  those 
that  are  transcribed  into  RNA  but  not  translated  into  protein 
(e.g.,  transfer  and  ribosomal  RNAs). 

Gene  family:  Group  of  closely  related  genes  that  make  simi- 
lar products. 

Gene  library:  See  genomic  library. 

Gene  mapping:  Determination  of  the  relative  positions  of 
genes  on  a  DNA  molecule  (chromosome  or  plasmid)  and  of 
the  distance,  in  linlcage  units  or  physical  units,  between  them. 

Gene  product:  The  biochemical  material,  either  RNA  or 
protein,  resulting  from  expression  of  a  gene.  The  amount  of 
gene  product  is  used  to  measure  how  active  a  gene  is;  abnor- 
mal amounts  can  be  correlated  with  disease-causing  alleles. 

Genetic  code:  The  sequence  of  nucleotides,  coded  in  triplets 
(codons)  along  the  mRNA,  that  determines  the  sequence  of 
amino  acids  in  protein  synthesis.  The  DNA  sequence  of  a 
gene  can  be  used  to  predict  the  mRNA  sequence,  and  the  ge- 
netic code  can  in  turn  be  used  to  predict  the  amino  acid  se- 
quence. 

Genetic  engineering  technology:  See  recombinant  DNA 
technology. 

Genetic  map:  See  linkage  map. 

Genetic  material:  See  genome. 

Genetics:  The  study  of  the  patterns  of  inheritance  of  .specific 
traits. 

Genome:  All  the  genetic  material  in  the  chromosomes  of  a 
particular  organism;  its  size  is  generally  given  as  its  total 
number  of  base  pairs. 

Genome  project;  Research  and  technology  development 
effort  aimed  at  mapping  and  sequencing  some  or  all  of  the 
genome  of  human  beings  and  other  organisms. 

Genomic  library:  A  collection  of  clones  made  from  a  set  of 
randomly  generated  overlapping  DNA  fragments  represent- 
ing the  entire  genome  of  an  organism.  Compare  library,  ar- 
rayed library. 

Guanine  (G):  A  nitrogenous  base,  one  member  of  the  base 
pair  G-C  (guanine  and  cytosine). 
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Haploid:  A  single  set  of  chromosomes  (half  the  full  set  of 
genetic  material),  present  in  the  egg  and  sperm  cells  of  ani- 
mals and  in  the  egg  and  pollen  cells  of  plants.  Human  beings 
have  23  chromosomes  in  their  reproductive  cells.  Compare 
diploid. 

Heterozygosity:  The  presence  of  different  alleles  at  one  or 
more  loci  on  homologous  chromosomes. 

Homeobox:  A  short  stretch  of  nucleotides  whose  base  se- 
quence is  virtually  identical  in  all  the  genes  that  contain  it.  It 
has  been  found  in  many  organisms  from  fhiit  flies  to  human 
beings.  In  the  fruit  fly,  a  homeobox  appears  to  determine 
when  particular  groups  of  genes  are  expressed  during  devel- 
opment. 

Homology:  Similarity  in  DNA  or  protein  sequences  between 
individuals  of  the  same  species  or  among  different  species. 

Homologous  chromosome:  Chromosome  containing  the 
same  linear  gene  sequences  as  another,  each  derived  from 
one  parent. 

Human  gene  therapy:  Insertion  of  normal  DNA  directly 
into  cells  to  correct  a  genetic  defect 

Human  Genome  Initiative:  Collective  name  for  several 
projects  begun  in  1986  by  DOE  to  (1)  create  an  ordered  set 
of  DNA  segments  from  known  chromosomal  locations, 
(2)  develop  new  computational  methods  for  analyzing  ge- 
netic map  and  DNA  sequence  data,  and  (3)  develop  new 
techniques  and  instruments  for  detecting  and  analyzing 
DNA.  This  DOE  initiative  is  now  known  as  the  Human  Ge- 
nome Program.  The  national  effort,  led  by  EMDE  and  NIH,  is 
known  as  the  Human  Genome  Project 

Hybridization:  The  process  of  joining  two  complementary 
strands  of  DNA  or  one  each  of  DNA  and  RNA  to  form  a 
double-stranded  molecule. 


Informatics:  The  study  of  the  application  of  computer  and 
statistical  techniques  to  the  management  of  information.  In 
genome  projects,  informatics  includes  the  development  of 
methods  to  search  databases  quickly,  to  analyze  DNA  se- 
quence information,  and  to  predict  protein  sequence  and 
structure  from  DNA  sequence  data. 


In  situ  hybridization:  Use  of  a  DNA  or  RNA  probe  to  de- 
tect the  presence  of  the  complementary  DNA  sequence  in 
cloned  bacterial  or  cultured  eukaryotic  cells. 

Interphase:  The  period  in  the  cell  cycle  when  DNA  is  repli- 
cated in  the  nucleus;  followed  by  mitosis. 

Intron:  The  DNA  base  sequence  interrupting  the  protein- 
coding  sequence  of  a  gene;  this  sequence  is  transcribed  into 
RNA  but  is  cut  out  of  the  message  before  it  is  translated  into 
protein.  Compare  exon. 

In  vitro:  Outside  a  living  organism. 
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Karyotype:  A  photomicrograph  of  an  individual's  chromo- 
somes arranged  in  a  standard  format  showing  the  number, 
size,  and  shape  of  each  chromosome  type;  used  in 
low-resolution  physical  mapping  to  correlate  gross  chromo- 
somal abnormalities  with  the  characteristics  of  specific  dis- 
eases. 

I(b:  See  kilobase. 

Kilobase  (kb):  Unit  of  length  for  DNA  fragments  equal  to 
1000  nucleotides. 


Library:  An  unordered  collection  of  clones  (i.e.,  cloned 
DNA  from  a  particular  organism),  whose  relationship  to  each 
other  can  be  established  by  physical  mapping.  Compare  ge- 
nomic library,  arrayed  library. 

Linkage:  The  proximity  of  two  or  more  markers  (e.g.,  genes, 
RFLP  markers)  on  a  chromosome;  the  closer  together  the 
markers  are,  the  lower  the  probability  that  they  will  be  sepa- 
rated during  DNA  repair  or  replication  processes  (binary  fis- 
sion in  prokaryotes,  mitosis  or  meiosis  in  eukaryotes),  and 
hence  the  greater  the  probability  that  they  will  be  inherited 
together. 

Linkage  map:  A  map  of  the  relative  positions  of  genetic  loci 
on  a  chromosome,  determined  on  the  basis  of  how  often  the 
loci  are  inherited  together.  Distance  is  measured  in 
centimorgans  (cM). 

Localize:  Determination  of  die  original  position  (locus)  of  a 
gene  or  other  marker  on  a  chromosome. 
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Locus  (pi.  lod):  The  position  on  a  chromosome  of  a  gene  or 
other  chromosome  marker;  also,  the  DNA  at  that  position. 
The  use  of  locus  is  sometimes  restricted  to  mean  regions  of 
DNA  that  are  expressed.  See  gene  expression. 


M 

Macrorestriction  map:  Map  depicting  the  order  of  and  dis- 
tance between  sites  at  which  restriction  enzymes  cleave  chro- 
mosomes. 

Mapping:  See  gene  mapping,  linkage  map,  physical  map. 

Marker:  An  identifiable  physical  location  on  a  chromosome 
(e.g.,  restriction  enzyme  cutting  site,  gene)  whose  inheritance 
can  be  monitored.  Markers  can  be  expressed  regions  of  DNA 
(genes)  or  some  segment  of  DNA  with  no  known  coding 
function  but  whose  pattern  of  inheritance  can  be  determined. 
See  RFLP,  restriction  fragment  length  polymorphism. 

Mb:  See  megabase. 

Megabase  (Mb):  Unit  of  length  for  DNA  fragments  equal  to 
1  million  nucleotides  and  roughly  equal  to  1  cM. 

Meiosis:  The  process  of  two  consecutive  cell  divisions  in  the 
diploid  progenitors  of  sex  cells.  Meiosis  results  in  four  rather 
than  two  daughter  cells,  each  with  a  haploid  set  of  chromo- 
somes. 

Messenger  RNA  (mRNA):  RNA  that  serves  as  a  template  for 
protein  synthesis.  See  genetic  code. 

Metaphase:  A  stage  in  mitosis  or  meiosis  during  which  the 
chromosomes  are  aligned  along  the  equatorial  plane  of  the  cell. 

Mitosis:  The  process  of  nuclear  division  in  cells  that  produces 
daughter  cells  that  are  genetically  identical  to  each  other  and 
to  the  parent  cell. 

mRNA:  See  messenger  RNA. 

Multifactorial  or  multigenlc  disorder:  See  polygenic 
disorder 

Multiplexing:  A  sequencing  approach  that  uses  several  pooled 
samples  simultaneously,  greatly  increasing  seque-xcing  speed. 

Mutation:  Any  heritable  change  in  DNA  sequence.  Compare 
polymorphism. 


N 


Nitrogenous  base:  A  nitrogen-containing  molecule  having 
the  chemical  properties  of  a  base. 

Nucleic  acid:  A  large  molecule  composed  of  nucleotide  sub- 
units. 

Nucleotide:  A  subunit  of  DNA  or  RNA  consisting  of  a  ni- 
trogenous base  (adenine,  guanine,  thymine,  or  cytosine  in 
DNA;  adenine,  guanine,  uracil,  or  cytosine  in  RNA),  a  phos- 
phate molecule,  and  a  sugar  molecule  (deoxyribose  in  DNA 
and  ribose  in  RNA).  Thousands  of  nucleotides  are  linked  to 
form  a  DNA  or  RNA  molecule.  See  DNA,  base  pair,  RNA. 

Nucleus:  The  cellular  organelle  in  eukaryotes  that  contains 
the  genetic  material. 


o 


Oncogene:  A  gene,  one  or  more  forms  of  which  is  associated 
with  cancer.  Many  oncogenes  are  involved,  directly  or  indi- 
rectly, in  controlling  the  rate  of  cell  growth. 

Overlapping  clones:  See  genomic  library. 


Pl-derived  artificial  chromosome  (PAC):  A  vector  used  to 
clone  DNA  fragments  ( 100-  to  300-kb  insert  size;  average, 
150  kb)  in  Escherichia  coli  cells.  Based  on  bacteriophage  (a 
virus)  PI  genome.  Compare  cloning  vector. 

PAC:  See  Pl-derived  artificial  chromosome. 

PCR:  See  polymerase  chain  reaction. 

Phage:  A  virus  for  which  the  natural  host  is  a  bacterial  cell. 

Physical  map:  A  map  of  the  locations  of  identifiable  land- 
marks on  DNA  (e.g.,  restriction  enzyme  cutting  sites,  genes), 
regardless  of  inheritance.  Distance  is  measured  in  base  pairs. 
For  the  human  genome,  the  lowest-resolution  physical  map 
is  the  banding  patterns  on  the  24  different  chromosomes;  the 
highest-resolution  map  would  be  the  complete  nucleotide 
sequence  of  the  chromosomes. 
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Plasmid:  Autonomously  replicating,  extrachromosomal  cir- 
cular DNA  molecules,  distinct  from  the  normal  bacterial  ge- 
nome and  nonessential  for  cell  survival  under  nonselective 
conditions.  Some  plasmids  are  capable  of  integrating  into  the 
host  genome.  A  number  of  artificially  constructed  plasmids 
are  used  as  cloning  vectors. 

Polygenic  disorder:  Genetic  disorder  resulting  ftom  the 
combined  action  of  alleles  of  more  than  one  gene  (e.g.,  heart 
disease,  diabetes,  and  some  cancers).  Although  such  disor- 
ders are  inherited,  they  depend  on  the  simultaneous  presence 
of  several  alleles;  thus  the  hereditary  patterns  are  usually 
more  complex  than  those  of  single-gene  disorders.  Compare 
single-gene  disorders. 

Polymerase  chain  reaction  (PCR):  A  method  for  amplify- 
ing a  DNA  base  sequence  using  a  heat-stable  polymerase  and 
two  20-base  primers,  one  complementary  to  the  (-t-)-strand  at 
one  end  of  the  sequence  to  be  amplified  and  the  other 
complementary  to  the  (-)-strand  at  the  other  end.  Because  the 
newly  synthesized  DNA  strands  can  subsequently  serve  as 
additional  templates  for  the  same  primer  sequences,  succes- 
sive rounds  of  primer  annealing,  strand  elongation,  and  dis- 
sociation produce  rapid  and  highly  specific  amplification  of 
the  desired  sequence.  PCR  also  can  be  used  to  detect  the  ex- 
istence of  the  defined  sequence  in  a  DNA  sample. 

Polymerase,  DNA  or  RNA:  Enzymes  that  catalyze  the  syn- 
thesis of  nucleic  acids  on  preexisting  nucleic  acid  templates, 
assembling  RNA  from  ribonucleotides  or  DNA  from  deox- 
yribonucleotides. 

Polymorphism:  Difference  in  DNA  sequence  among  indi- 
viduals. Genetic  variations  occurring  in  more  than  1%  of  a 
population  would  be  considered  useful  polymorphisms  for 
genetic  lirkkage  analysis.  Compare  mutation. 

Primer:  Short  preexisting  polynucleotide  chain  to  which  new 
deoxyribonucleotides  can  be  added  by  DNA  polymerase. 

Probe:  Single-stranded  DNA  or  RNA  molecules  of  specific 
base  sequence,  labeled  either  radioactively  or  immunologi- 
cally, that  are  used  to  detect  the  complementary  base  se- 
quence by  hybridization. 

Prokaryote:  Cell  or  organism  lacking  a  membrane-bound, 
structurally  discrete  nucleus  and  other  subcellular  compart- 
ments. Bacteria  are  prokaryotes.  Compare  eukaryote.  See 
chromosome. 

Promoter:  A  site  on  DNA  to  which  RNA  polymerase  will 
bind  and  initiate  transcription. 


Protein:  A  large  molecule  composed  of  one  or  more  chains 
of  amino  acids  in  a  specific  order,  the  order  is  determined  by 
the  base  sequence  of  nucleotides  in  the  gene  coding  for  the 
protein.  Proteins  are  required  for  the  structure,  function,  and 
regulation  of  the  bodys  cells,  tissues,  and  organs,  and  each 
protein  has  unique  functions.  Examples  are  hormones,  en- 
zymes, and  antibodies. 

Purine:  A  nitrogen-containing,  single-ring,  basic  compound 
that  occurs  in  nucleic  acids.  The  purines  in  DNA  and  RNA 
are  adenine  and  guanine. 

Pyrimidine:  A  nitrogen-containing,  double-ring,  basic  com- 
pound that  occurs  in  nucleic  acids.  The  pyrimidines  in  DNA 
are  cytosine  and  thymine;  in  lU^A,  cytosine  and  uracil. 


R 

Rare-cutter  enzyme:  See  restriction  enzyme  cutting  site. 

Recombinant  done:  Clone  containing  recombinant  DNA 
molecules.  See  recombinant  DNA  technology. 

Recombinant  DNA  molecules:  A  combination  of  DNA  mol- 
ecules of  difrerent  origin  that  are  joined  using  recombinant 
DNA  technologies. 

Recombinant  DNA  technology:  Procedure  used  to  join  to- 
gether DNA  segments  in  a  cell-free  system  (an  environment 
outside  a  cell  or  organism).  Under  appropriate  conditions,  a 
recombinant  DNA  molecule  can  enter  a  cell  and  replicate 
there,  either  autonomously  or  after  it  has  become  integrated 
into  a  cellular  chromosome. 

Recombination:  The  process  by  which  progeny  derive  a 
combination  of  genes  different  from  that  of  either  parent.  In 
higher  organisms,  this  can  occur  by  crossing  over. 

Regulatory  region  or  sequence:  A  DNA  base  sequence  that 
controls  gene  expression. 

Resolution:  Degree  of  molecular  detail  on  a  physical  map  of 
DNA.  ranging  from  low  to  high. 

Restriction  enzyme,  endonuclease:  A  protein  that  recog- 
nizes specific,  short  nucleotide  sequences  and  cuts  DNA  at 
those  sites.  Bacteria  contain  over  400  such  enzymes  that  rec- 
ognize and  cut  over  100  different  DNA  sequences.  See  re- 
striction enzyme  cutting  site. 
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Restriction  enzyme  cutting  site:  A  specific  nucleotide  se- 
quence of  DNA  at  which  a  particular  restriction  enzyme  cuts 
the  DNA.  Some  sites  occur  frequently  in  DNA  (e.g.,  every 
several  hundred  base  pairs),  others  much  less  frequently 
(rare-cutter;  e.g.,  every  10,0(X)  base  pairs). 

Restriction  fragment  length  polymorphism  (RFLP): 

Variation  between  individuals  in  DNA  fragment  sizes  cut  by 
specific  restriction  enzymes;  polymorphic  sequences  that 
result  in  RFLPs  are  used  as  markers  on  both  physical  maps 
and  genetic  linkage  maps.  RFLPs  are  usually  caused  by  mu- 
tation at  a  cutting  site.  See  marker. 

RFLP:  See  restriction  fragment  length  polymorphism. 

Ribonucleic  acid  (RNA):  A  chemical  found  in  the  nucleus 
and  cytoplasm  of  cells;  it  plays  an  important  role  in  protein 
synthesis  and  other  chemical  activities  of  the  cell.  The  struc- 
ture of  RNA  is  similar  to  that  of  DNA.  There  are  several 
classes  of  RNA  molecules,  including  messenger  RNA,  transfer 
RNA,  ribosomal  RNA,  and  other  small  RNAs,  each  serving 
a  different  purpose. 

Ribonucleotide:  See  nucleotide. 


Sex  chromosome:  The  X  or  Y  chromosome  in  human  be- 
ings that  determines  the  sex  of  an  individual.  Females  have 
two  X  chromosomes  in  diploid  cells;  males  have  an  X  and  a 
Y  chromosome.  The  sex  chromosomes  comprise  the  23rd 
chromosome  pair  in  a  karyotype.  Compare  autosome. 

Shotgun  method:  Cloning  of  DNA  fragments  randomly 
generated  from  a  genome.  See  library,  genomic  library. 

Single-gene  disorder:  Hereditary  disorder  caused  by  a  mu- 
tant allele  of  a  single  gene  (e.g.,  Duchenne  muscular  dys- 
trophy, retinoblastoma,  sickle  cell  disease).  Compare  poly- 
genic disorders. 

Somatic  cell:  Any  cell  in  the  body  except  gametes  and  their 
precursors. 

Southern  blotting:  Transfer  by  absorption  of  DNA  firag- 
ments  separated  in  electrophoretic  gels  to  membrane  filters 
for  detection  of  specific  base  sequences  by  radiolabeled 
complementary  probes. 

STS:  See  sequence  tagged  site. 


Ribosomal  RNA  (rRNA):  A  class  of  RNA  found  in  the  ribo- 
somes  of  cells. 

Ribosomes:  Small  cellular  components  composed  of  spe- 
cialized ribosomal  RNA  and  protein;  site  of  protein  synthe- 
sis. See  ribonucleic  acid  (RNA). 

RNA:  See  ribonucleic  acid. 


Sequence:  See  base  sequence. 

Sequence  tagged  site  (STS):  Short  (2(X)  to  500  base  pairs) 
DNA  sequence  that  has  a  single  occurrence  in  the  human 
genome  and  whose  location  and  base  sequence  are  known. 
Detectable  by  polymerase  chain  reaction,  STSs  are  useful  for 
localizing  and  orienting  the  mapping  and  sequence  data  re- 
ported from  many  different  laboratories  and  serve  as  land- 
marks on  the  developing  physical  map  of  the  human  ge- 
nome. Expressed  sequence  tags  (ESTs)  are  STSs  derived 
from  cDNAs. 

Sequencing:  Determination  of  the  order  of  nucleotides  (base 
sequences)  in  a  DNA  or  RNA  molecule  or  the  order  of  amino 
acids  in  a  protein. 


Tandem  repeat  sequences:  Multiple  copies  of  the  same 
base  sequence  on  a  chromosome;  used  as  a  marker  in 
physical  mapping. 

Technology  transfer:  The  process  of  converting  scientific 
findings  from  research  laboratories  into  useful  products  by 
the  commercial  sector. 

Telomere:  The  end  of  a  chromosome.  This  specialized 
structure  is  involved  in  the  replication  and  stability  of  linear 
DNA  molecules.  See  DNA  replication. 

Thymine  (T):  A  nitrogenous  base,  one  member  of  the  base 
pair  A-T  (adenine-thymine). 

Transcription:  The  synthesis  of  an  RNA  copy  from  a  se- 
quence of  DNA  (a  gene);  the  first  step  in  gene  expression. 
Compare  translation. 

Transfer  RNA  (tRNA):  A  class  of  RNA  having  structures 
with  triplet  nucleotide  sequences  that  are  complementary  to 
the  triplet  nucleotide  coding  sequences  of  mRNA.  The  role 
of  tRNAs  in  protein  synthesis  is  to  bond  with  amino  acids 
and  transfer  them  to  the  ribosomes,  where  proteins  are  as- 
sembled according  to  the  genetic  code  carried  by  mRNA. 
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Tnuisformatioii:  A  process  by  which  the  genetic  material  Virus:  A  noncellular  biological  entity  that  can  reproduce 

carried  by  an  individual  cell  is  altered  by  incorporation  of  only  within  a  host  cell.  Viruses  consist  of  nucle.c  acid  cov- 

exogenous  DNA  into  its  genome.  ered  by  protein;  some  animal  viruses  are  also  surrounded  by 

membrane.  Inside  the  infected  cell,  the  vums  uses  the  syn- 
Translation:  The  process  in  which  the  genetic  code  carried        thetic  capability  of  the  host  to  produce  progeny  virus, 
by  mRNA  directs  the  synthesis  of  proteins  from  amino  acids. 
Compare  transcription.  VLSI:  Very  large  scale  integration  allowmg  more  than 

100,000  transistors  on  a  chip. 
tRNA:  See  transfer  RNA. 

Y 
U 

YAC:  See  yeast  artificial  chromosome. 

Uracil:  A  nitrogenous  base  normally  found  in  RNA  but  not 

DNA-  uracU  is  capable  of  forming  a  base  pair  with  adenme.        Yeast  artmdal  chromosome  (YAC):  A  vector  used  to  clone 

DNA  fragments  (up  to  400  kb);  it  is  constructed  from  the 
telomeric,  centromeric,  and  replication  origin  sequences 

■«r  needed  for  replication  in  yeast  cells.  Compare  cloning  vector. 

Vector:  See  cloning  vector. 
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Preface 


More  than  a  decade  ago.  the  Office  of  Health  and  Environmental  Research  (OHER)  of  the  U.S.  Depart- 
ment of  Energy  (DOE)  stnick  a  bold  course  in  launching  its  Human  Genome  Initiative,  convinced  that 
its  mission  would  be  well  served  by  a  comprehensive  picture  of  the  human  genome.  Organizers  recog- 
nized that  the  information  the  project  would  generate — both  technological  and  genetic — would  con- 
tribute not  only  to  a  new  understanding  of  human  biology  and  the  effects  of  energy  technologies  but  also  to  a  host  of 
practical  applications  in  the  biotechnology  industry  and  in  the  arenas  of  agriculture  and  environmental  protection. 

Today,  the  project's  value  appears  beyond  doubt  as  worldwide  participation  contributes  toward  the  goals  of  determining 
the  human  genome's  complete  sequence  by  2005  and  elucidating  the  genome  structure  of  several  model  organisms  as 
well.  This  report  summarizes  the  content  and  progress  of  the  DOE  Human  Genome  Program  (HGP).  Descriptive 
research  summaries,  along  with  information  on  program  histor>'.  goals,  management,  and  current  research  highlights, 
provide  a  comprehensive  view  of  tfie  DOE  program. 

Last  year  marked  an  early  transition  to  the  third  and  fmal  phase  of  the  U.S.  Human  Genome  Project  as  pilot  programs  to 
refine  large-scale  sequencing  strategies  and  resources  were  funded  by  DOE  and  the  National  Institutes  of  Health,  the  two 
sponsoring  U.S.  agencies.  The  human  genome  centers  at  Lawrence  Berkeley  National  Laboratory.  Lawrence  Livermore 
National  Laboratory,  and  Los  Alamos  National  Laboratory  had  been  serving  as  the  core  of  DOE  multidisciplinary  HGP 
research,  which  requires  extensive  contributions  from  biologists,  engineers,  chemists,  computer  scientists,  and  mathema- 
ticians. These  team  efforts  were  complemented  by  those  at  other  DOE-supported  laboratories  and  about  60  universities, 
research  organizations,  companies,  and  foreign  institutions.  Now.  to  focus  DOE's  considerable  resources  on  meeting  the 
challenges  of  large-scale  sequencing,  the  sequencing  efforts  of  the  three  genome  centere  have  been  integrated  into  the 
Joint  Genome  Institute.  The  institute  will  continue  to  bring  together  research  from  other  DOE-supported  laboratories. 
Work  in  other  critical  areas  continues  to  develop  the  resources  and  technologies  needed  for  production  sequencing;  com- 
putational approaches  to  data  management  and  interpretation  (called  informatics);  and  an  exploration  of  the  important 
ethical,  legal,  and  social  issues  arising  from  use  of  the  generated  data,  particularly  regarding  the  privacy  and  confidenti- 
ality of  genetic  information. 

Insights,  technologies,  and  infrastructiue  emerging  from  the  Human  Genome  Project  are  catalyzing  a  biological  revolu- 
tion. Health-related  biotechnology  is  already  a  success  story — and  is  still  far  from  reaching  its  potential.  Other  applica- 
tions are  likely  to  beget  similar  successes  in  coming  decades;  among  these  are  several  of  great  importance  to  DOE. 
We  can  look  to  improvements  in  waste  control  and  an  exciting  era  of  environmental  bioremediation.  we  will  see  new 
approaches  to  improving  energy  efficiency,  and  we  can  hope  for  dramatic  strides  toward  meeting  the  fuel  demands  of 
the  future. 

In  1997  OHER.  renamed  the  Office  of  Biological  and  Environmental  Research  (OBER).  is  celebrating  50  years  of  con- 
ducting research  to  exploit  the  boundless  promise  of  energy  technologies  while  exploring  their  consequences  to  the 
public's  health  and  the  environment.  The  DOE  Human  Genome  Program  and  a  related  spin-off  project,  the  Microbial 
Genome  Program,  are  major  components  of  the  B  iological  and  Environmental  Research  Program  of  OBER. 

EKDE  OBER  is  proud  of  its  contributions  to  the  Human  Genome  Project  and  welcomes  general  or  scientific  inquiries 
concerning  its  genome  programs.  Announcements  soliciting  research  applications  appear  in  Federal  Register,  Science, 
Human  Genome  News,  and  other  publications.  The  deadline  for  formal  applications  is  generally  midsummer  for  awards 
to  be  made  the  next  year,  and  submission  of  prcproposals  in  areas  of  potential  interest  is  strongly  encouraged  Further 
information  may  be  obtained  by  contacting  the  program  office  or  visiting  the  DOE  home  page  (301/903-6488, 
Fax:  SSZl, genome@oerdoe.gov,  URL:  http://Yrww.er.doe.gov/production/ober/hug_top.html). 

^stideVjEA^ft©*!  Associate  Dir^Q^ 
Office  of  Biological  and  Environmental  Research 
U.S.  Department  of  Energy 
Novembers.  1997 
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he  research  abstracts  in  this  section  were  funded  in  FY  1996  by  the  DOE  Office  of  Health  and  Environ- 
mental Research,  which  was  renamed  Office  of  Biological  and  Environmental  Research  in  1997. 


These  unedited  abstracts  were  contributed  by  DOE  Human  Genome  Program  grantees  and  contractors. 
Names  of  principal  investigators  are  in  bold  print  Submitted  in  1996.  contact  information  is  for  the  fu^t  person  named 
unless  another  investigator  is  designated  as  contact  person.  Principal  investigators  of  research  projects  described  by 
abstracts  in  this  section  are  listed  under  their  respective  subject  categories,  and  an  index  of  all  investigators  named  in 
the  abstracts  is  given  at  the  end  of  this  report 

Part  I  of  this  report  contains  narratives  that  represent  DOE  Human  Genome  Program  research  in  large,  multidisci- 
plinary  projects.  As  a  convenience  to  the  reader,  these  narratives  are  reprinted  (without  graphics)  as  an  appendix  to  this 
volume.  Part  2.  The  projects  represent  woric  at  the  Joint  Genome  Institute  (p.  72),  Lawrence  Livermore  National  Labo- 
ratory Human  Genome  Center  (p.  73),  Los  Alamos  National  Laboratory  Center  for  Human  Genome  Studies  (p.  77). 
Lawrence  Berkeley  National  Laboratory  Human  Genome  Center  {p.  8 1 ),  University  of  Washington  Genome  Center 
(p.  85).  Genome  Database  (p.  87).  and  National  Center  for  Genome  Resources  (p.  9 1 ).  Only  the  contact  persons  for 
these  organizations  are  Usted  in  the  Index  to  Principal  and  Coin vestiga tors.  More  information  on  research  carried  out  in 
these  projects  can  be  found  on  their  listed  Web  sites. 
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tual  mass  data  could  be  determined.  To  address  this  prob- 
lem, we  are  developing  a  detector  that  will  simultaneously 
measure  the  charge  and  velocity  of  individual  ions.  We 
have  been  able  to  mass  analyze  DNA  molecules  in  the  I  to 
10  MDa  range  using  chaise-detection  mass  spectrometry. 
In  this  technique,  individual  electrospray  ions  are  directed 
to  fly  through  a  metal  tube  which  detects  their  image 
charge.  Simultaneous  measurement  of  their  velocity  pro- 
vides a  way  to  measure  their  mass  when  ions  of  known 
energy  are  sampled.  Several  thousand  ions  can  be  ana- 
lyzed in  a  few  minutes,  thus  generating  statistically  sig- 
nificant mass  values  regarding  the  ions  in  a  sample  popu- 
lation. We  are  attempting  to  apply  this  technology  to  the 
analysis  of  PCR  products. 

DOE  Contract  No.  DE-AC03-76SF00098. 

Mass  Spectrometer  for  Human 
Genome  Sequencing 

Chung-Hsuan  Chen,  Steve  L.  Allman.  and  K.  Bruce 

Jacobson 

Oak  Ridge  National  Laboratory,  Oak  Ridge.  TN  37831 

423/574-5895,  Fax:  -2115,  chenc@oml.gov 

The  objective  of  this  program  is  to  develop  an  innovative 
fast  DNA  sequencing  technology  for  the  Human  Genome 
Project.  It  can  also  be  applied  to  fast  screening  of  genetic 
and  contagious  diseases.  DNA  fingerprinting,  and  envi- 
ronmental impact  analysis. 

The  approach  of  this  program  is  to  replace  conventional 
gel  electrophoresis  sequencing  methods  by  using  lasers 
and  mass  spectrometry  for  sequencing.  The  present  gel 
sequencing  method  usually  takes  hours  to  days  to  acquire 
DNA  analysis  or  sequencing,  since  different  lengths  of 
DNA  segments  need  to  be  separated  in  dense  gel.  With 
laser  desorption  mass  spectrometry  fl,DMS)  approach, 
various  sizes  of  DNA  segments  are  separated  in  the 
vacuum  chamber  of  a  mass  spectrometer.  Thus,  the  time 
taken  to  separate  various  sizes  of  DNA  is  less  than  one 
second  compared  to  hours  using  other  methods. 

Recently,  we  successfully  demonstrated  sequencing  short 
DNA  segments  with  this  approach.  We  also  have  suc- 
ceeded in  using  LDMS  for  fast  screening  of  cystic  fibrosis 
disease.  We  succeeded  in  identifying  both  point  mutation 
and  deletion  of  cystic  fibrosis.  In  addition,  we  had  pre- 
liminary success  in  using  LDMS  to  achieve  DNA  finger- 
printing. Thus,  laser  desorption  mass  spectrometry 
(LDMS)  is  going  to  emerge  as  a  new  and  important  bio- 
technological  tool  for  DNA  analysis. 

DOE  Contract  No.  DE-AC05-84OR21400. 

•ProieclsdcMfMled  by  an  aaeri*  received  nnall  enwgoKy  gra*.  following  December  1992  «e  review,  by  David  Gala.-  (fonnerly  DOE  Office  of 
H^Sr^dlTSHal  Re^arch.  which  wa.,  renalS  Office  of  Biolopd  ^  Envi,onn«.ul  Re^arch  in  1997),  Raymond  Ge«ela».  (U^venHy 
of  Utah),  and  Elbcit  BranKorab  (Lawrence  Livermore  National  Laboraory). 


Advanced  Detectors  for  Mass 
Spectrometry 

W.H.  Benner  and  J.M.  Jaklevic 

Human  Genome  Group;  Engineering  Science  Department; 
Lawrence  Berkeley  National  Laboratory;  University  of 
CaUfomia;  Berkeley,  CA  94720 
510/486-7194,  Fax;  -5857.  whbenner@lbl.gov 
hnp./Avww-hgc.lbl.gov 

Mass  spectrometry  is  an  instrumental  method  capable  of 
producing  rapid  analyses  with  high  mass  accuracy.  When 
applied  to  genome  research,  it  is  an  attractive  alternative  to 
gel  electrophoresis.  At  present,  routine  DNA  analysis  by 
mass  spectrometry  is  seriously  constrained  to  small  DNA 
fragments.  Contrasted  to  other  mass  spectrometry  facilities 
in  which  the  development  of  ladder  sequencing  is  empha- 
sized, we  are  exploring  the  application  of  mass  spectrom- 
etry to  procedures  that  identify  short  sequences.  This  ap- 
proach helps  the  molecular  biologists  associated  with 
LBL's  Human  Genome  Center  to  identify  redundant  se 
quences  and  vector  contamination  in  clones  rapidly, 
thereby  improving  sequencing  efficiency.  We  are  also  at- 
tempting to  implement  a  rapid  mass  spectrometry-based 
screening  procedure  for  PCR  products. 

The  implementation  of  these  applications  requires  that  the 
performance  of  matrix-assisted-laser-desorption-ionization 
(MALDl)  and  electrospray  mass  spectrometry  is  im- 
proved. Our  focus  is  the  development  of  new  ion  detectors 
which  will  advance  the  state-of-the-art  of  each  of  these 
two  types  of  spectrometers.  One  of  the  limitations  for  ap- 
plying nMSS  spectrometry  to  DNA  analysis  relates  to  the 
poor  efficiency  with  which  conventional  electron  multipli- 
ers detect  large  ions,  a  problem  most  apparent  in 
MALDI-TOF-MS.  To  solve  this  problem,  we  are  develop 
ing  alternative  detection  schemes  which  rely  on  heat  pulse 
detection.  The  kinetic  energy  of  impacting  ions  is  con- 
verted into  heat  when  ions  strike  a  detector  and  we  are  at- 
tempting to  measure  indirectly  such  heat  pulses.  We  are 
developing  a  type  of  cryogenic  detector  called  a  supercon- 
ducting tunnel  junction  device  which  responds  to  the 
phonons  produced  when  ions  strike  the  detector.  This  de- 
tector does  not  rely  on  the  formation  of  secondary  elec- 
trons. We  have  demonstrated  this  type  of  detector  to  be  at 
least  two  orders  of  magnitude  more  sensitive,  on  an 
area- normalized  basis,  than  microchannel  plate  ion  detec- 
tors. This  development  could  extend  the  upper  mass  limit 
of  MALDI-TOF-MS  and  increase  sensitivity. 

Electrospray  ion  sources  generate  ions  of  mega-Dalton 
DNA  with  minimal  fragmentation,  but  the  mass  spectro- 
metric  analyses  of  these  Urge  ions  usually  leads  only  to  a 
mass-to-charge  distribution.  If  ion  charge  was  known,  ac- 
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Genomic  Sequence  Comparisons 

George  Church 

Harvard  Medical  School;  Boston,  MA  02115 
617/432-0503  or  -7562,  Fax:  -7266 
hnp://arep.med.harvard.edu 

The  first  objective  of  this  project  is  completion  of  an  auto- 
mated system  to  sequence  DNA  using  electrophore 
mass-tag  (EMT)  primers  for  dideoxy  sequencing.  The  pro- 
totype machine  will  contain  a  60  capillary  array  with  400 
EMTlabeled  sequence  ladders  per  capillary.  The  system  is 
designed  to  use  100-fold  less  reagent  and  have  500-fold 
higher  speed  (1000  bases  per  sec  per  instrument)  than  cur- 
rent sequencing  technology.  Cleavage  and  laser  desorption 
of  EMTs  from  membranes  for  subsequent  detection  by 
ECTOF  mass  spectrometry.  The  second  objective  is  to 
overcome  the  limitations  of  purely  hypothetical  annotation 
of  the  growing  number  of  reading  frames  in  new  genome 
sequences.  We  measure  gene  product  levels  and  interac- 
tions using  DNA  microarrays,  whole  genome  in  vivo 
footprinting  and  crosslinking. 

Our  approach  involves  system  integration  of  instrumenta- 
tion, organic  chemistry,  molecular  biology,  electrophoresis 
and  software  to  the  task  of  increasing  sequencing  accuracy 
and  efficiency.  Likewise  we  integrate  such  instruments  and 
others  with  the  needs  of  acquiring  and  annotation  of 
large-scale  microbial  and  human  genomic  sequence  and 
population  polymorphisms. 

To  establish  functions  for  new  genes,  we  use  large  scale 
phenotyping  by  multiplexed  growth  competition  assays, 
both  by  targeted  deletion  and  by  saturation  insertional  mu- 
tagenesis. We  will  continue  to  develop  a  system  to  se- 
quence DNA  using  electrophore  mass-tags  (EMTs).  We 
will  establish  genome-scale  experimental  methods  for  se- 
quence annotation. 

The  most  significant  findings  in  1995-1996  were  1)  Dem- 
onstration of  use  of  electrophore  mass-tags  in  dideoxy  se- 
quencing. 2)  Development  of  IR-laser  desorption  method 
and  model.  3)  A  novel  dsDNA  microarray  synthesis  strat- 
egy. 4)  A  new  amplifiable  differential  display  for 
whole-genome  in  vivo  DNA-protein  interactions.  5)  Estab- 
lishment and  application  of  a  microbial  DNA-protein  inter- 
action database. 

DOE  Grant  No.  DE-FG02-87ER60565. 


A  PAC/BAC  End-Sequence  Data 
Resource  for  Sequencing  the  Human 
Genome:  A  2- Year  Pilot  Study 

Pieter  de  Jong 

Roswell  Park  Cancer  Institute;  Buffalo,  NY  14263 
716/845-3168,  Fax:  -%%^9,  pieter@dejong.med.buffalo.edu 
hnp://bacpac.medbuffalo.edu 

Large  scale  sequencing  of  the  Human  genome  requires  the 
availability  of  high-fidelity  clones  with  large  genomic  in- 
serts and  a  mechanism  to  find  clones  with  minimal  over- 
laps within  the  clone  collections.  The  first  need  can  be  sat- 
isfied with  bacterial  artificial  chromosome  libraries  (PACs 
and  BACs)  which  already  exist  and  further  such  libraries 
now  being  developed.  However,  a  cost-effective  way  for 
establishing  high-resolution  contig  maps  for  the  human 
genome  has  not  yet  been  established.  Recently,  a  new  ap- 
proach for  virtual  screening  for  overlapping  clones  has 
been  proposed  by  several  research  groups  and  has  been 
discussed  eloquently  in  a  manuscript  by  Venter  et  al.,  1996 
(Nature).  We  will  implement  this  approach  for  use  with 
our  human  PAC  and  BAC  libraries  and  use  the  fu^t  year  as 
a  pilot  stage.  The  goal  of  the  one  year  pilot  is  to  prove  the 
feasibility  of  large  scale  end  sequencing  and  to  demon- 
strate usefulness. 

The  first  goal  will  be  met  by  sequencing  the  ends  for 
40,000  clones  from  our  existing  PAC  library  and  from 
BAC  libraries  currently  being  developed  under  NIH  fund- 
ing within  our  laboratory.  The  end-sequencing  will  be 
based  on  our  new  DOP-vector  PCR  procedure  (Chen  et  al, 
1996,  Nucleic  Acids  Research  24,  2614-2616).  All  se- 
quence data  will  be  made  available  through  public  data- 
bases (GSDB,  GDB,  Genbank)  and  will  also  become 
BLAST  searchable  through  the  UTSW  WWW  site  from 
our  collaborator.  Glen  Evans.  In  view  of  our  current 
under-developed  informatics  structure,  we  do  not  expect  to 
provide  BLAST  search  access  through  our  own  web  site 
during  the  pilot  phase. 

To  prove  the  usefulness  of  available  end  sequences,  we 
will  prepare  a  chromosome  I4-enriched  clone  collection 
from  our  current  20-fold  deep  PAC  library.  To  detect  the 
chromosome  14  clones,  we  will  use  as  hybridization 
probes  a  set  of  1 ,000  mapped  STS  markers  available  from 
Paul  Dear  (MRC,  Cambridge,  UK),  the  about  600  markers 
present  in  the  Whitehead  map  and  the  in  situ  mapped  BAC 
and  PAC  clones  available  from  Julie  Korenberg.  We  will 
hybridize  with  these  existing  markers  in  probe  pools,  spe- 
cific for  regions  of  chromosome  14.  Thus  we  will  isolate 
region-enriched  PAC  clone  collections. 

Assuming  that  the  clone  collections  will  be  at  least 
50%-specific  for  chromosome  14  (50%  false  positives) 
and  will  include  most  of  the  chromosome  14  PACs  from 
our  library,  a  collection  of  about  35,000  clones  is  expected. 
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Hence,  the  bulk  of  the  end  sequences  obtained  during  the 
first  year  will  be  derived  from  the  chromosome  14  en- 
riched set  and  should  result  in  a  sequence  ready  clone  col- 
lection covering  about  100  Mbp  of  the  human  genome. 
The  purity  of  the  chromosome  14  PAC  collection  will  be 
characterized  in  a  number  of  different  ways,  including  test- 
ing with  independent  markers  not  used  as  probes  and  by 
FISH  analysis  of  a  representative  set  of  PAC  clones.  To 
test  the  usefulness  of  the  end  sequence  resource,  the 
Sanger  Centre  will  sequence  chromosome  14  PACs  from 
our  collection  and  identify  overlapping  clones  by  virtual 
screening,  using  our  end-sequence  database. 

If  overlapping  clones  can  not  be  found  with  the  expected 
level  of  redundancy  in  the  end-sequence  database,  we  will 
screen  the  original  PAC  library  with  probes  or  STS  mark- 
ers derived  from  the  sequenced  PAC  clones. 

Subcontract  under  Glen  Evans'  DOE  Grant  No.  DE-FC03- 

96ER62294. 

Multiple-Column  Capillary  Gel 
Electrophoresis 

Norman  Oovichi 

Department  of  Chemistry;  University  of  Alberta; 
Edmonton,  Alberta,  Canada  T6G  2G2 
403/492-2845,  Fax:  -S23\.  norm.dovichi@ualbena.ca 
hnp://hobbes.  chem.  ualberta.ca 

The  objective  of  this  project  is  to  develop  high-throughput 
DNA  sequencing  instrumentation.  A  two-dimensional  ar- 
rayed capillary  electrophoresis  instrument  is  under  devel- 
opment. 

We  have  developed  multiple  capillary  DNA  sequencers. 
These  instruments  have  several  important  attributes.  First, 
by  operation  at  electric  fields  greater  than  100  V/cm,  we 
are  able  to  separate  DNA  sequencing  fragments  rapidly 
and  efficiently.  Second,  the  separation  is  performed  with 
3%T  0%C  polyacrylamide.  This  low  viscosity, 
non-crosslinked  matrix  can  be  pumped  from  the  capillary 
and  replaced  with  fresh  material  when  required.  Third,  we 
operate  the  capillary  at  elevated  temperature.  High  tem- 
perature operation  eliminates  compressions,  speeds  the 
separation,  and  increases  the  read  length.  Fourth,  our  fluo- 
rescence detection  cuvette  is  manufactured  locally  by 
means  of  microlithography  technology.  These  detection 
cuvettes  provide  robust  and  precise  alignment  of  the  opti- 
cal system.  Currently,  5,  16,  and  90  capillary  instniments 
are  in  operation  in  our  lab;  32  and  576  capillary  devices 
are  under  development.  Fourth,  we  use  both  avalanche 
photodiode  photodetectors  and  CCD  cameras  for  high  sen- 
sitivity detection.  We  have  obtained  detection  limits  of  120 
fluorescein  molecules  injected  onto  the  capillaries.  High 
sensitivity  is  important  in  detecting  the  low  concentration 
fragments  generated  in  long  sequencing  reads.  This  combi- 


nation of  low  concentration  acrylamide,  high  temperature 
operation,  and  high  sensitivity  detection  allows  separation 
of  fragments  over  800  bases  in  length  in  90  minutes. 

DOE  Grant  No.  DE-FG02-91ER6I 123. 


DNA  Sequencing  with  Primer  Libraries 

John  J.  Dunn,  Laura-Li  Butler-Lofifredo,  and  F.  William 
Studier 

Biology  Department;  Brookhaven  National  Laboratory; 

Upton.  NY  1 1 973 

516/344-3012,  Fax:  -3407,  dunn@genomel.bio.bnl.gov 

hnp://genome5.bio.bnl.gov 

Primer  walking  using  oligonucleotides  selected  from  a  li- 
brary is  an  attractive  strategy  for  large-scale  DNA  se- 
quencing. Strings  of  three  adjacent  hexamers  can  prime 
DNA  sequencing  reactions  specifically  and  efficiently 
when  the  template  is  saturated  with  a  single  stranded 
DNA-binding  protein  (1),  and  a  library  of  all  4,096 
hexamers  is  manageable.  We  would  like  to  be  able  to  se- 
quence directly  on  35-kbp  fesmid  templates,  but  the  signal 
from  a  single  round  of  synthesis  is  relatively  weak  and 
triple-hexamer  priming  has  not  yet  been  adapted  for  cycle 
sequencing.  We  reasoned  that  a  hexamer  library  might  be 
used  for  cycle  sequencing  if  combinations  of  hexamers 
could  be  selectively  ligated  by  using  other  hexamers  as  the 
template  for  alignment.  In  this  way,  the  longer  primers 
needed  for  cycle  sequencing  could  be  generated  easily  and 
economically  without  the  need  for  complex  machines  for 
de  novo  synthesis. 

We  found  that  ordered  ligation  of  3  hexamers  to  form  an 
18-mer  occurs  readily  on  a  template  of  the  3  complemen- 
tary hexamers  (offset  by  three  base  pairs)  that  can  base 
pair  unambiguously  to  form  a  double-stranded  complex  of 
indefinite  length  (2).  Each  hexamer  forms  three  comple- 
mentary base  pairs  with  two  other  hexamers,  generating 
complementary  chains  of  contiguous  hexamers  with  strand 
breaks  staggered  by  three  bases.  Two  adjacent  hexamers  in 
the  chain  to  be  ligated  contain  5'  phosphate  groups  and  the 
others  are  unphosphorylated.  Both  T4  and  T7  DNA  ligase 
can  ligate  the  phosphorylated  hexamers  to  their  neighbors 
in  such  a  complex  at  hexamer  concentrations  in  the  50-100 
M  range,  producing  an  18-mer  and  leaving  three  unphos- 
phorylated hexamers.  The  products  of  these  ligation  reac- 
tions can  be  used  directly  for  fluorescent  cycle  sequencing 
of  35-kbp  templates. 

Unambiguous  ligation  requires  that  alternative  complexes 
with  perfect  base  pairing  not  be  possible  with  the  combina- 
tion of  hexamers  used.  Since  the  combination  of  hexamers 
is  dictated  by  the  sequence  of  the  desired  ligation  product, 
some  oligonucleotides  cannot  be  produced  unambiguously 
by  this  method.  However,  82.5%  of  all  possible  18-mers 
could  potentially  be  generated  starting  with  a  library  of  all 
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4096  hexamers,  more  than  adequate  for  high  throughput 
DNA  sequencing  by  primer  walking. 

DOE  Grant  No.  DE-AC02-76CH00016. 
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We  have  developed  a  vector,  referred  to  as  a  fesmid,  for 
making  libraries  of  approximately  35-kbp  DNAs  for  map- 
ping and  sequencing.  The  high  efficiency  lambda  packag- 
ing system  is  used  to  generate  libraries  of  clones.  These 
clones  are  propagated  at  very  low  copy  number  under  con- 
trol of  the  replication  and  partitioning  functions  of  the  F 
factor,  which  helps  to  stabilize  potentially  toxic  clones.  A 
PI  lytic  replicon  under  control  of  the  lac  repressor  allows 
amplification  simply  by  adding  IPTG.  The  cloned  DNA 
fragment  is  flanked  by  packaging  signals  for  bacteriophage 
T7,  and  infection  with  an  appropriate  T7  mutant  packages 
the  cloned  sequence  into  T7  phage  particles,  leaving  most 
of  the  vector  sequence  behind.  The  size  of  the  vector  por- 
tion is  such  that  genomic  fragments  packageable  in  lambda 
(normal  capacity  48.5  kbp)  should  also  be  packaged  in  T7 
(normal  capacity  40  kbp). 

We  have  made  fesmid  libraries  of  several  bacterial  DNAs, 
including  Borrelia  burgdorferi  (the  cause  of  Lyme  disease), 
Bartonella  henselae  (the  cause  of  cat  scratch  fever),  E. 
coli,  B.subtilis,  H.  influenzae,  and  S.  pneumoniae,  some  of 
which  have  been  reported  to  be  difficult  to  clone  in  cosraid 
vectors.  Human  DNA  is  also  readily  cloned  in  these  vec- 
tors. Brief  amplification  followed  by  infection  with  a  gene 
3  and  17.5  double  mutant  of  T7,  which  is  defective  in  rep- 
licating its  own  DNA,  produces  lysates  in  which  essen- 
tially all  of  the  phage  particles  contain  the  cloned  DNA 
fragment.  Simple  techniques  yield  high-quality  DNA  from 
these  phage  particles.  Primers  for  direct  sequencing  firom 
the  ends  of  fesmid  clones  have  been  made. 

Primer  walking  from  the  ends  of  fesmid  clones  could  be  an 
efficient  way  to  sequence  bacterial  genomes,  YACs,  or 
other  large  DNAs  without  the  need  for  prior  mapping  of 
clones.  The  ends  of  fesmids  from  a  random  library  provide 


multiple  sites  to  initiate  primer  walking.  Merging  of  the 
elongating  sequences  from  different  clones  will  simulta- 
neously generate  the  sequence  of  the  original  DNA  and 
determine  the  order  of  the  clones.  The  packaged  fesmid 
DNAs  are  a  convenient  size  for  multiple  restriction  analy- 
ses to  confuTO  the  accuracy  of  the  nucleotide  sequence. 

DOE  Grant  No.  DE-AC02-76CH00016. 
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While  current  plans  call  for  completing  the  human  genome 
sequence  in  2003,  major  obstacles  remain  in  achieving  the 
speed  and  efficiency  necessary  to  complete  the  task  of 
mapping  and  sequencing.  As  an  approach  to  this  problem, 
we  proposed  a  novel  approach  to  large  scale  construction 
of  sequence-ready  physical  clone  maps  of  the  human  ge- 
nome utiUzing  end-specific  sequence  sampling.  An  earlier 
pilot  project  was  initially  carried  out  to  develop  a  GSS  (ge- 
nomic sequence  sampled)  map  of  human  chromosome  11 
by  sequencing  the  ends  of  17,952  chromosome  11  specific 
cosmids.  This  chromosome  11 -specific  end-sequence  data- 
base allows  rapid  and  sensitive  detection  of  clone  overlaps 
for  chromosome  11 -sequencing. 

In  this  project,  we  propose  to  evaluate  the  utility  of  PAC 
and  BAC  end-sequences  representing  the  entire  human 
genome  as  a  tool  for  complete,  high  accuracy  mapping  and 
sequencing.  In  this  approach,  we  utilized  total  genomic 
PAC/BAC  libraries  (constructed  by  P  de  Jong,  RPCI),  fol- 
lowed by  end-sequencing  of  both  ends  of  each  clone  in  the 
library  and  limited  regional  mapping  of  a  subset  of  clones 
as  sequencing  nucleation  points  by  FISH  (Fluorescence  in 
situ  hybridization). 

To  initiate  regional  analysis,  a  single  clone  would  be  se- 
quenced by  shotgun  or  primer  directed  sequencing,  the 
entire  sequence  used  to  search  the  end-database  for  over- 
lapping clones,  and  the  minimal  overlapping  clones  for 
extending  the  sequence  selected.  This  approach  would  al- 
low rational  and  efficient  simultaneous  mapping  and  se- 
quencing, as  well  as  expediting  the  coordination  and  ex- 
change of  information  between  large  and  small  groups  par- 
ticipating in  the  human  genome  project 
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In  this  pilot  project  proposal  we  are  carrying  out  auto- 
mated end-sequencing  of  approximately  40,000  PAC  and 
BAC  clones  representing  the  entire  human  genome,  as 
well  as  about  500  PAC  clones  localized  to  human  chromo- 
somes 1 1  and  15.  The  clones  and  resulting  end-sequence 
data  base  will  be  utilized  to  1)  nucleate  regions  of  interest 
for  large  scale  sequencing  concentrating  on  regions  of 
chromosome  11  and  15,  2)  correspond  with  regions 
mapped  by  other  methods  to  confirm  the  mapping  accu- 
racy and  3)  used  to  evaluate  the  use  of  random  clone  end 
sequence  libraries.  DNA  sequencing  is  being  carried  out  in 
an  entirely  automated  fashion  using  a  Beckman/Sagian 
robotic  system,  ABI  377  automated  sequencers  and  auto- 
mated sequence  data  processing,  annotation  and  publica- 
tion using  a  Hewlett  Packard/Convex  superparallel  com- 
puter located  at  the  UTSW  genome  center.  FISH  analysis 
of  a  sample  of  PAC  clones  has  been  carried  out  and  de- 
fines the  potential  chimera  rate  in  existing  PAC  libraries  as 
less  than  1 .2%.  This  effort  will  be  coordinated  with  efforts 
of  other  groups  carrying  out  PAC  and  BAC  library  con- 
struction, PAC  and  BAC  end-sequencing  and  FISH  analy- 
sis to  avoid  duplication  of  effort  and  provide  a  comprehen- 
sive end-sequence  library  and  data  set  for  use  by  the  inter- 
national human  genome  sequencing  effort. 

DOE  Grant  No.  DE-FC03-96ER62294. 
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The  development  of  efficient  mapping  approaches  coupled 
with  high  throughput,  automated  DNA  sequencing  remains 
one  of  the  key  challenges  of  the  Human  Genome  Project. 
Over  the  past  few  years,  a  number  of  strategies  to  expedite 
clone-by-clone  DNA  sequencing  have  been  developed  in- 
cluding efficient  shotgun  sequencing,  sequencing  of  nested 
deletions,  and  transposon-mediated  primer  insertion.  We 
have  developed  a  novel  sequencing  strategy  applicable  to 
high  throughput,  large  scale  genomic  analysis  based  upon 
DNA  sequencing  directly  primed  on  of  cosmid  templates 
using  custom-designed,  automatically  synthesized  oligo- 
nucleotide primers.  This  approach  of  directed  primer 
"walking"  would  allow  the  number  of  sequencing  reac- 
tions and  the  efficiency  of  sequencing  to  be  vastly  im- 
proved over  traditional  shotgun  sequencing. 


Custom  primer  design  has  been  carried  out  using  software 
we  developed  for  prediction  of  "walking"  primers  directly 
from  the  output  of  AB 1377  automated  DNA  sequencers, 
and  the  output  used  to  automatically  program  synthesis  of 
the  custom  primers  using  96  or  192  channel  oligonucle- 
otide synthesizers  constructed  at  UTSW.  Automated  opera- 
tion of  the  sequencing  system  is  thus  possible  where  re- 
sults of  each  sequencing  reaction  is  used  to  predict,  syn- 
thesize, and  carry  out  appropriate  extension  reactions  for 
downstream  "walking".  A  automated  prototype  system  has 
been  assembled  where  dye  terminator  DNA  sequencing 
can  be  carried  out  from  96  cosmid  templates  simulta- 
neously followed  by  prediction  of  oligonucleotide  "walk- 
ing" primers  for  extending  the  sequence  of  each  fragment, 
and  programming  an  attached  96-channel  oligonucleotide 
synthesizer  to  initiate  a  second  round  of  sequencing.  Using 
a  set  of  nested  cosmids  covering  800  kb  at  5X  redundancy, 
primer  directed  sequencing  should  allow  completion  of 
800  kb  of  finished,  high  accuracy  DNA  sequence  in  8  to 
16  cycles.  Furthermore,  coupling  of  automated  DNA  se- 
quencing instrumentation  to  DNA  sequence  analysis  pro- 
grams and  multichannel  oligonucleotide  synthesizers  will 
allow  almost  complete  automation  of  sequencing  process 
and  the  development  of  instrumentation  for  completely 
unattended  DNA  sequencing. 

DOE  Grant  No.  DE-FG03-95ER62055. 

♦Parallel  Triplex  Formation  as  Possible 
Approach  for  Suppression  of 
DNA-Viruses  Reproduction 

V.L.  Florentiev,  A.K.  Shchyolkina,  I.A.  Il'icheva,  E.N. 
Tunofeev,  and  S.  Yu  Tsybenko 
Engelhardt  Institute  of  Molecular  Biology;  Russian 
Academy  of  Sciences;  Moscow  1 17984,  Russia 
Fax:  -H7-095/135-1405, /7or@imi.(mfc.ac.ra 

It  is  well  known  that  homopurine  or  homopyrimidine 
single  stranded  oligonucleotides  can  bind  to 
homopurine-homopyrimidine  sequences  of  two-stranded 
DNA  to  form  stable  three-stranded  helices.  In  such  tri- 
plexes two  identical  strands  have  antiparallel  orientation. 
We  denote  these  triplexes  as  "antiparallel"  or  "classical" 
triplexes. 

A  particular  interest  of  investigators  to  triplexes  has  arisen 
due  to  an  elegant  idea  of  using  triplexes  as 
sequence-specific  tools  for  purposeful  influence  on  DNA 
duplexes.  Triplex  forming  oligonucleotides  were  shown  to 
be  potentially  useful  as  regulators  of  gene  expression  and 
subsequently  as  therapeutical  (antiviral)  agents. 

A  significant  limitation  to  the  practical  application  of  anti- 
parallel  triplex  is  the  requirement  for  homopurine  tracts  in 
target  DNA  sequences.  Numerous  investigations  sUghtly 
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expanded  the  repertoire  of  triple-forming  sequences  but 
did  not  completely  remove  this  limitation. 

It  was  recently  shown  that  during  homologous  recombina- 
tion promoted  by  RecA  a  triple-stranded  DNA  intermedi- 
ate was  formed.  Such  a  structure  is  a  new  form  of  the  triple 
helix.  In  sharp  contrast  with  the  "classical"  triplexes  their 
third  strand  is  parallel  to  the  identical  strand  of  the 
Watson-Crick  duplex.  We  denote  this  structure  as  "paral- 
lel" triplex.  Recently,  the  parallel  triplex  was  obtained  only 
by  deproteinization  of  joint  molecules  generated  by  recom- 
bination proteins. 

We  first  obtained  experimental  (chemical  probe,  melting 
curves  and  fluorescence  due  binding)  results  that  provide 
convincingly  evidence  for  protein-independent  formation 
of  parallel  triplex  [  1  ]  and  than  confirmed  this  fact  by  FTIR 
data  [2].  Because  the  parallel  triplex  can  be  formed  for  any 
sequence,  it  might  be  "ideal"  potential  tool  for  sequence 
specific  recognition  of  DNA.  Unfortunately,  low  stability 
of  parallel  triplexes  prohibits  practical  application  of  these 
structures. 

Earlier  we  found  that  propidium  iodide  stabilizes  selec- 
tively the  parallel  triplexes  [3].  This  fact  was  the  basis  of 
new  approach  to  stabilization  of  parallel  triplexes  being 
developed  by  us  now.  The  approach  consists  in  use  of  tar- 
geting oligonucleotide,  which  contains  in  intemucleotide 
linkage  the  alkyl  insert  coupled  with  intercalated  ligand 
through  linker  Length  of  linker  was  chosen  to  allow 
ligand  to  intercalate  in  the  same  stacking-contact  (length 
of  linker  was  picked  by  molecular  dynamic  calculations). 

Preliminary  study  showed  that  presence  of  intercalating 
inserts  increase  considerably  stability  of  DNA  duplexes 
[4].  Now  we  are  investigating  in  detail  effect  of  such 
modification  of  targeting  oligonucleotides  on  stability  of 
parallel  triplexes. 

DOE  Grant  No.  OR00033-93CIS(X)5. 
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Automation  of  a  large-scale  sequencing  process  based  on 
instrumentation  for  automated  DNA  hybridization  and  de- 
tection is  a  focal  point  of  our  research.  Recently,  we  have 
devised  a  method  for  amplifying  fluorescent  light  output 
on  nylon  membranes  by  using  an  alkaline  phosphatase- 
conjugated  probe  system  combined  with  a  fluorogenic  al- 
kaline phosphatase  substrate  [1],  The  amplified  signal  al- 
lows sensitive  detection  of  DNA  hybrids  in  the 
sub-femtomole/band  range. 

On  the  basis  of  this  detection  chemistry,  automated  devices 
for  detecting  DNA  on  blotted  microporous  membranes  us- 
ing enzyme-linked  fluorescence,  termed  Probe  Chambers, 
have  been  built.  The  fluorescent  signal  is  collected  by  a 
CCD  camera  operating  in  a  Time  Delay  and  Integration 
mode.  Concentrated  solutions  of  probes  and  enzymes  are 
stored  in  Peltier-cooled  septa  sealed  vials  and  delivered  by 
syringe  pimips  residing  in  a  gantry  style  pipetting  robot. 
Fluorescence  excitation  is  generated  by  a  mercury  arc 
lamp  acting  through  a  fiber  optic  "light  line".  Three  30  x 
63  centimeter  sequencing  membranes  can  be  simulta- 
neously processed,  currently  revealing  up  to  108  lane  sets 
per  multiplex  cycle.  A  probing  cycle  is  completed  approxi- 
mately every  eight  hours. 

Integration  of  the  Probe  Chamber  into  the  production  pipe 
line  is  accompUshed  through  connections  to  the  laboratory 
data  base.  A  critical  component  of  a  high-throughput  se- 
quencing laboratory  is  the  software  for  interfacing  to  in- 
strumentation and  managing  work  flow.  The  Informatics 
Group  of  the  Utah  Genome  Center  has  designed  and 
implemented  an  innovative  system  for  automating  and 
managing  laboratory  processes.  This  software  allows  the 
model  of  workflow  to  be  easily  defined.  Given  such  a 
mtxlel,  the  system  allows  the  user  to  direct  and  track  the 
flow  of  laboratory  information.  The  core  of  the  system  is  a 
generic,  client-server  process  management  engine  that  al- 
lows users  to  define  new  processes  without  the  need  for 
custom  programming.  Based  on  these  definitions,  the  soft- 
ware will  then  route  information  to  the  next  process,  track 
the  progress  of  each  task,  perform  any  automated  opera- 
tions, and  provide  reports  on  these  processes.  To  further 
increase  the  usefulness  of  our  laboratory  information  sys- 
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tem,  we  have  augmented  it  with  hand-help  mobile  comput- 
ing devices  (Apple  Newtons)  that  link  to  the  database 
through  RF  networking  cards. 

Base  calUng  software  has  been  developed  to  support  our 
automated,  large  scale  sequencing  effort.  1st  stage  se- 
quence calling  identifies  putative  bands,  however,  depend- 
ing on  the  number  of  reader  indel  errors  (2-6%),  merging 
1st  stage  sequence  without  the  aide  of  cutoff  information 
can  be  difficult.  To  improve  our  base  calling  we  have  em- 
ployed Fuzzy  Logic  to  establish  confidence  metrics.  The 
logic  produces  a  confidence  metric  for  each  band  using 
band  height,  width,  uniqueness,  shape,  and  the  gaps  to  ad- 
jacent bands.  The  confidence  metric  is  then  used  to  iden- 
tify the  largest  block  of  highest  quality  sequence  to  be 
merged. 

DOE  Grant  No.  DE-FG03-94ER61817. 
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The  purpose  of  the  Resource  for  Molecular  Cytogenetics  is 
to  develop  molecular  cytogenetic  techniques,  instruments 
and  reagents  needed  to  facilitate  large  scale  genomic  DNA 
sequencing  and  to  assist  in  identification  and  functional 
characterization  of  genes  involved  in  disease  susceptibility, 
genesis  and  progression.  This  work  is  closely  coordinated 
with  the  LBNL  Human  Genome  Program  and  directly  sup- 
ports research  in  the  LBNL  Life  Sciences  Division  and  the 
UCSF  Cancer  Center.  Work  currently  is  in  four  areas: 
a)Genome  analysis  technology,  b)Probe  development  and 
physical  map  assembly,  c)Digital  imaging  microscopy  and 
d)Informatics.  The  Resource  acts  as  a  catalyst  for  research 
in  several  areas  so  some  support  comes  from  Industry,  the 
NIH  and  NIST. 

Probe  development  and  physical  map  assembly:  The  Re- 
source maintains  a  list  of  over  a  thousand  publicly  available 
probes  suitable  for  molecular  cytogenetic  studies.  These  in- 
clude approximately  600  probes  each  selected  by  the  Re- 
source to  contain  a  known  STS  or  EST.  Probes  selected  by 
the  Resource  can  be  requested  through  our  web  page. 


The  Resource  also  participates  in  the  development  of  low 
and  high  resolution  physical  maps  to  facilitate  analysis  and 
characterization  of  genetic  abnormalities  associated  with 
human  disease.  Low  resolution  mapping  panels  with 
probes  distributed  at  few  megabase  intervals  have  been 
completed  this  year  for  chromosomes  1,  2,  3,  7,  8,  10,  and 
20.  The  mapped  STSs  associated  with  these  probes  facili- 
tate movement  from  low  to  high  resolution  physical  maps. 
STS  content  mapping  and  DNA  fingerprinting  have  been 
applied  to  develop  a  high  resolution,  sequence-ready  map 
comprised  of  BAC  and  PI  clones  for  the  -1Mb  region  of 
chromosome  20  between  WI9227  and  D20S902.  This  re- 
gion is  amplified  in  -10%  of  human  breast  cancers.  Ap- 
proximately 3(X)  kb  of  this  region  has  been  sequenced  by 
the  LBNL  Human  Genome  Program. 

Quantitative  DNA  fiber  mapping  (QDFM)  has  been  devel- 
oped this  year  to  facilitate  high  resolution  analysis  of  ge- 
nomic overlap  between  cloned  probes.  In  this  approach, 
cloned  DNA  molecules  are  uniformly  stretched  during  dry- 
ing by  the  hydrodynamic  action  of  a  receding  meniscus. 
The  position  of  specific  sequences  along  the  stretched 
DNA  molecules  is  visualized  by  fluorescence  in  situ  hy- 
bridization (FISH)  and  measured  by  digital  image  analysis. 
QDFM  has  been  used  to  map  gamma  alpha  transposons, 
plasmid  or  cosmid  probes  along  PI  molecules,  and  PI  or 
PAC  clones  along  straightened  YAC  molecules  with  few 
kilobase  resolution.  QDFM  is  now  being  studied  to  deter- 
mine its  utility  in  the  assembly  of  minimally  overlapping, 
sequence-ready  contigs,  assessment  of  the  integrity  of 
cloned  B  ACs  and  mapping  of  subclones  prepared  for  di- 
rected DNA  sequencing  along  the  clone  from  which  they 
were  derived. 

Genome  analysis  technology:  The  Resource  has  partici- 
pated in  the  development  of  comparative  genomic  hybrid- 
ization (CGH)  as  a  tool  for  detection  and  mapping  of 
changes  in  relative  DNA  sequence  copy  number  in  humans 
and  mouse.  This  year,  CGH  to  arrays  of  cloned  probes 
(CGHa)  has  been  demonstrated.  This  is  advantageous  be- 
cause it  allow  aberrations  to  be  mapped  with  resolution 
determined  by  the  genomic  spacing  of  probes  on  the  array. 
CGHa  also  is  attractive  since  it  appears  to  be  linear  over  a 
relative  copy  number  range  of  at  least  104  between  the  two 
nucleic  acid  samples  being  compared. 

The  Resource  has  participated  in  the  development  of  FISH 
approaches  to  analysis  of  relative  gene  expression  in  nor- 
mal and  aberrant  tissues.  FISH  with  cloned  or  predicted 
expressed  sequences,  previously  developed  in  C.  elegans, 
is  now  being  applied  to  the  assessment  of  expression  of 
human  genes.  The  C.  elegans  work  suggests  a  throughput 
of  several  dozen  sequences  per  month.  Information  from 
this  approach  will  be  important  in  assessment  of  the  func- 
tion of  newly  discovered  genes,  including  those  predicted 
from  DNA  sequencing. 

(abstract  continued) 
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Digital  imaging  microscopy:  The  Resource  supports  work 
in  microscopy,  image  processing  and  analysis  methods 
needed  for  CGH  and  CGHa,  3D  FISH,  tissue  analysis,  rare 
event  detection,  multi-color  image  acquisition,  aberration 
scoring  for  biodosimetry,  and  analysis  of  FISH  to  DNA 
fibers.  Developments  this  year  include  an  improved  pack- 
age for  CGH  and  prototype  systems  for  analysis  of  DNA 
fibers,  CGHa  arrays  and  semiautomatic  segmentation  of 
nuclei  in  three  dimensions. 

Informatics:  The  Resource  maintains  a  web  site  at  http:// 
rmc-www.lbl.gov  that  summarizes  information  about 
mapped  probes.  Probes  developed  by  the  Resource  can  be 
requested  directly  through  this  page.  In  addition,  the  Re- 
source has  developed  a  Web  page  for  exchange  of  ge- 
nomic, genetic  and  biologic  information  between  geo- 
graphically disperse  collaborators.  The  page,  under  pass- 
word control,  carries  information  about  physical  maps, 
genomic  sequence,  sequence  annotation,  and  gene  expres- 
sion images. 

DOE  Contract  No.  DEAC0376SF(X)098. 
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Automation 

Trevor  Hawkins 

Center  for  Genome  Research;  Whitehead  Institute/Massa- 
chusetts Institute  of  Technology;  Cambridge,  MA  02 1 39 
617/252-1910,  Fax:  -1902,  tlh@genome.wi.mit.edu 
http://www-genome.  wi.mit.edu 

The  objective  of  this  project  is  to  develop  a  high-tlirough- 
put,  fully  automated  robotic  device  for  the  complete  auto- 
mation of  the  sequencing  process.  We  also  aim  to  further 
develop  DNA  sequencing  electrophoresis  systems  and  to 
integrate  these  devices  with  our  robotics. 

We  have  built  the  Sequatron,  an  integrated,  robotic  device 
which  automates  the  tasks  of  DNA  purification  and  setup 
of  thermal  cycle  sequencing  reactions.  The  major  compo- 
nent of  our  system  is  an  articulated  CRS  255A  robotic  arm 
which  is  track  mounted.  The  deck  of  the  robot  contains 
several  new  or  modified  XYZ  robotic  workstations,  a 
novel  thermal  cycler  with  automated  headed  lids,  carou- 
sels, and  custom  built  plate  feeders. 

Biochemically,  we  have  employed  our  Solid-phase  revers- 
ible immobilization  (SPRI)  technique  to  isolate  and  ma- 
nipulate the  DNA  throughout  the  process. 

Specifically  we  have  set  up  the  Sequatron  to  isolate  DNA 
from  MI  3  phage  or  crude  PCR  products  using  the  same 
protocol  and  procedures.  From  M13  phage  we  obtain  ap- 
proximately Ig  of  DNA  per  well,  which  is  sufficient  for 
multiple  sequencing  reactions. 


The  current  throughput  of  the  system  is  80  microliter  plates 
of  samples  from  M 1 3  phage  supematants  or  crude  PCR 
products  to  sequence  ready  samples  every  24  hours.  Re- 
cently, new  enzymes,  new  energy  transfer  primers  and  higher 
density  microtiter  plates  have  opened  up  possible  increases 
to  in  excess  of  25,000  samples  per  24  hour  period. 

DOE  Grant  No.  DB-FG02-95ER62099. 
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Genome  Sequencing 

Leroy  Hood,  Mark  D.  Adams,'  and  Melvin  Simon' 

University  of  Washington;  Seattle,  WA  98195-7730 

206/616-5014,  Fax:  /685-7301,  tawny® u.wa.<:hington.edu 

'The  Institute  for  Genomic  Research;  Rockville,  MD 

20850;  mdadams@tigr.org 

K^Ufomia  Institute  of  Technology;  Pasadena,  CA  91125; 

simonm@starbasel.caltech.edu 

Bacterial  artificial  chromosomes  (BACs)  represent  the 
state  of  the  art  cloning  system  for  human  DNA  because  of 
their  stability  and  ease  of  manipulation.  Venter,  Smith  and 
Hood  (Nature  381:364-366,  1996)  have  proposed  a  strat- 
egy based  on  the  use  of  sequences  from  the  ends  of  all 
clones  in  a  deep  coverage  SAC  library  to  produce  a 
sequence-ready  set  of  clones  for  the  human  genome.  We 
propose  to  demonstrate  the  effectiveness  of  this  strategy  by 
performing  a  directed  test,  initially  on  chromosomes  16 
and  22,  and  continuing  on  to  chromosome  1 .  All  available 
markers  on  chromosome  16  (including  the  large  number  of 
soon-to-be-available  radiation  hybrid  markers)  will  be 
used  to  screen  the  existing  8x  BAC  library  at  CalTech. 
This  will  serve  to  evaluate  the  quality  of  the  library  in 
terms  of  representation  of  broad  chromosomal  regions.  A 
similar  procedure  will  be  used  for  chromosome  22,  except 
that  the  existing  BAC  map  will  be  used  to  select  more 
evenly  spaced  markers  for  screening,  including  use  of 
end-sequence  markers  from  the  current  chromosome  22 
BAC  map  constructed  in  the  Simon  lab.  Each  identified 
clone  will  be  rearrayed  from  the  library  and  end  se- 
quenced. This  information  will  dovetail  nicely  with  ongo- 
ing sequencing  projects  at  TIGR  and  the  Sanger  Centre, 
which  will  in  turn  provide  additional  information  on  the 
average  degree  of  BAC  overlap  detectable  by  this  method, 
the  degree  of  interference  with  genome-wide  repeats,  and 
the  appropriate  use  of  fingerprinting  as  an  early  or  late  ad- 
dition to  the  end-sequencing  information.  In  addition,  we 
will  develop  and  implement  cost-effective, 
high-throughput  methods  of  preparing  and  end-sequencing 
BAC  DNA  that  are  suitable  for  scaling  to  characterization 
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of  the  full  400,000  clones  necessary  for  characterization  of 
a  15x  human  BAC  library. 

E)OE  Grant  No.  DE-FC03-96ER62299. 

DNA  Sequencing  Using  Capillary 
Electrophoresis 

Barry  L.  Karger 

Bamett  Institute;  Northeastern  University;  Boston,  MA 

02115 

617/373-2867  or  -2868,  Fax:  -2855 

bakarger®  lynx.neu.edu 

During  the  past  year,  we  have  made  major  progress  in  the 
design  of  a  replaceable  polymer  matrix  for  DNA  sequenc- 
ing and  the  development  of  the  first  generation  multiple 
capillary  array  of  12  capillaries.  We  also  implemented 
ultrafast  separation  of  dsDNA  (e.g.  30  sec  for  complete 
resolution  of  the  standard  X 174-HAE  HI  restriction  frag- 
ments). 

In  the  separation  of  sequencing  reaction  products,  we  com- 
pleted a  study  on  the  role  of  polymer  molecular  weight  and 
concentration.  Using  linear  polyacrylamide  (LPA),  the 
polymer  with  which  we  have  had  our  most  success,  we 
have  achieved  1000  base  read  lengths  in  1  1/2  hrs.  Optimi- 
zation of  column  length,  electric  field  and  column  tem- 
perature (50°  C)  was  required.  Using  emulsion  polymer- 
ization, we  are  now  able  to  produce  LPA  powders  with 
MW  of  -  lO-"  k  Da.  The  fully  replaceable  matrix  is  very 
powerful  for  rapid  sequencing  of  long  reads. 

We  have  successfully  implemented  a  12-capillary  array 
instrument  and  are  using  it  to  study  issues  of  ruggedness  in 
routine  sequencing.  As  part  of  this,  we  have  developed  a 
sample  clean-up  procedure  which  reduces  all  reactions  to  a 
similar  state  in  terms  of  sample  solution  prior  to  injection. 
The  results  of  this  work  have  led  to  the  design  of  a  96-cap- 
illaiy  array  that  we  will  implement  over  the  next  year. 

We  have  also  achieved  very  fast  separations  of  ss-  and 
dsDNA  using  short  capillaries  and  very  high  yields.  For 
example,  sequencing  300  bases  in  3^  mins.  has  been 
shown,  as  well  as  very  rapid  mutational  analysis.  Imple- 
mentation of  such  speeds  on  a  capillary  array  will  create 
an  instrument  for  high  throughput  automated  analysis. 

DOE  Grant  No.  DE-FG02-90ER60985. 
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Ultrasensitive  Fluorescence  Detection 
of  DNA 

Richard  A.  Mathies  and  Alexander  N.  Glazer 

Departments  of  Chemistry  and  Molecular  and  Cell 
Biology;  University  of  California;  Berkeley  CA  94720 
510/642-4192,  Fax:  -3599,  rich@zinc.cchem.berkeley.edu 

The  overall  goal  of  this  project  is  to  develop  new  fluores- 
cence labeling  methods,  separation  methods  and  detection 
technologies  for  DNA  sequencing  and  genomic  analysis. 

Highlights  along  with  representative  publications  are  given 
below. 

Energy  Transfer  Primers.  Families  of  sequencing  and  PCR 
primers  have  been  developed  that  contain  both  fluores- 
cence donor  and  acceptor  chromophores.'  These  labeled 
primers  with  optimized  excitation  and  emission  properties 
provide  from  2-  to  20-fold  enhanced  signal  intensities  in 
automated  DNA  sequencing  with  slab  gels  and  with  capil- 
lary arrays.^  The  reduced  spectral  cross  talk  of  these  ET 
primers  also  makes  them  valuable  in  PCR  product  and 
STR  analyses.' 

New  Intercalation  Dye  Labels.  A  new  family  of 
heterodimeric  bis-intercalation  dyes  has  been  synthesized 
exploiting  the  concept  of  fluorescence  energy  transfer  be- 
tween two  different  cyanine  intercalators.^  By  tailoring  the 
spectroscopic  properties  of  the  dyes,  labels  with  intense 
emission  above  650  nm  following  488  nm  excitation  have 
been  fabricated.  By  adjusting  tiie  spacing  linker  between 
the  two  dyes,  the  binding  affinity  has  also  been  optimized. 
These  molecules  are  useful  for  noncovalent  multiplex  la- 
beling of  ds-DNA  in  a  wide  variety  of  multicolor  analy- 
ses.' 

Capillary  Electrophoresis  Chips.  Capillary  and  capillary 
array  electrophoresis  systems  have  been  photolithographi- 
cally  fabricated  on  2x3'  glass  substrates."  These  devices 
provide  high  quality  electrophoretic  separations  of 
ds-DNA  fragments  and  DNA  sequencing  reactions  with  a 
10-fold  increase  in  speed.'  Arrays  of  up  to  32  capillaries  on 
a  single  chip  have  been  fabricated. 

Single  DNA  Molecule  Fluore.icence  Bur.it  Detection.  A 
confocal  fluorescence  system  has  been  used  to  demon- 
strate tiiat  single  molecule  fluorescence  burst  counting  can 
be  used  to  detect  CE  separations  of  ds-DNA  fragments. 
Fragments  as  small  as  50  bp  can  be  counted  and  mass  sen- 
sitivities as  low  as  100  molecules  per  electrophoresis  band 
are  possible.  This  technology  should  be  valuable  in  incipi- 
ent cancer  and  trace  pathogen  detection.* 


DOE  Grant  No.  DE-FG03-91ER61125. 
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In  19%,  more  than  thirty  U.S.  and  Russian  research  work- 
ers participated  in  the  joint  Human  Genome  Program  be- 
tween Argonne  National  Laboratory  and  Engelhardt  Insti- 
tute of  Molecular  Biology  on  the  development  of  sequenc- 
ing by  hybridization  with  oligonucleotide  microchips 
(SHOM). 

During  this  year,  about  twenty  Russian  scientists  have 
been  working  from  3  njonths  to  I  year  in  ANL.  In  this  pe- 
riod, 3  papers  have  been  published  and  5  papers  accepted 
for  publication,  3  more  papers  are  submitted  for  publica- 
tion. 

The  main  research  efforts  of  the  group  have  been  concen- 
trated in  three  directions: 

I.  Improvement  of  SHOM  technology. 

II.  Development  of  SHOM  for  the  needs  of  Human  Ge- 
nome Program. 


III.  Development  of  new  approaches  based  on  SHOM 
technology. 

I.  Improvement  of  SHOM  technology 

As  a  major  result  of  the  work  in  this  direction,  simple,  reli- 
able and  effective  methods  of  microchip  manufacturing, 
sample  preparations,  and  quantitative  hybridization  analy- 
sis by  fluorescence  microscopy  have  been  developed  or 
improved. 

1 .  Photopolymerization  technique  for  production  of 
micromatrices  of  polyacrylamide  gel  pads  on 
hydrophobicized  glass  surface  was  improved  to  become  a 
simple,  highly  reproducible  and  inexpensive  procedure  (7). 

2.  New  and  cheaper  chemistry  of  the  oligonucleotide  im- 
mobilization has  been  developed  and  introduced  for  pro- 
duction of  more  durable  microchips.  It  is  based  on  the  use 
of  amino-oligonucleotides  and  aldehyde-gels  instead  of 
3-methyluridine-oligonucleotides  and  hydrazide-gels  (3). 

3.  Four-pin  robot  has  been  constructed  with  computer  con- 
trol of  every  microchip  element  production.  High  quality 
microchips  with  4100  inmiobilized  oligonucleotides  have 
been  manufactured  and  the  complexity  of  the  microchips 
can  easily  be  scaled  up  to  a  few  tens  of  thousand  elements. 

4.  TVvo-color  fluorescence  microscope  has  been  equipped 
for  regular  use  with  proper  mechanics  and  software.  It  al- 
lows investigators  to  regularly  use  the  automatic  quantita- 
tive monitoring  of  the  hybridization  on  the  whole  micro- 
chip and  to  measure  the  kinetics  of  hybridization  as  well  as 
the  melting  curves  of  duplexes  formed  with  all  microchip 
oligonucleotides  (lv2,8). 

5.  Foiu'-color  fluorescence  microscope  was  manufactured 
and  four  proper  fluorescence  dyes  are  at  present  under  se- 
lection. 

6.  Chemical  methods  of  introduction  of  several  fluores- 
cence dyes  into  DNA  and  RNA  with  or  without  fragmenta- 
tion have  been  developed  and  regularly  used  in  SHOM 
experiments  (4). 

7.  A  theory  describing  the  kinetics  of  hybridization  with 
gel-inunobilized  oligonucleotides  has  been  developed  (5). 

8.  Simple  and  relatively  inexpensive  equipment  (around 
$10,000  per  set)  has  been  produced  for  manual  manufac- 
turing of  microchips  and  fluorescence  measurement  of  hy- 
bridization, which  will  enable  every  laboratory  to  produce 
and  practically  use  microchips  containing  up  to  100  immo- 
bilized oligonucleotides  or  other  compounds. 

n.  Application  of  SHOM 

Although  the  main  goal  of  our  SHOM  development  is  to 
produce  a  simple  de  novo  sequencing  procedure,  a  number 
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of  other  SHOM  applications  have  been  tested  as  interme- 
diate steps  in  the  SHOM  research. 

I.  Sequence  analysis  and  sequencing 

A  number  of  technical  problems  should  be  solved  for  de 
novo  sequencing  although  they  are  much  less  stringent  for 
comparative  sequence  analysis  than  for  de  novo  sequenc- 
ing. Among  these: 

a)  Reliable  discrimination  of  perfect  and  mismatched  du- 
plexes. We  have  significantly  improved  the  discrimination 
by  decreasing  the  length  of  hybridized  oligonucleotides  to 
6-and  8-mers  (1.  7)  and  by  using  5-mers  in  "contiguous 
stacking"  hybridization  (1,2).  Essential  improvement  was 
also  achieved  by  automatic  measuring  of  the  melting 
curves  for  duplexes  formed  in  each  microchip  element  and 
calculating  their  thermodynamic  parameters,  free  energy, 
enthalpy  and  entrophy  for  different  regions  of  the  melting 
curves  and  by  comparing  them  with  these  parameters  for 
perfect  duplexes.  In  addition,  a  highly  reliable  discrimina- 
tion was  achieved  by  using  two-color  fluorescence  micros- 
copy and  by  quantitative  comparison  of  the  hybridization 
pattern  of  a  known  DNA  or  synthetic  oligonucleotides  and 
DNA  under  smdy  labeled  with  different  fluorophores  (8). 

b)  Difference  in  hybridization  efficiency  depends  on  the 
GC-content  and  the  length  of  the  duplex  We  have  equal- 
ized the  efficiency  by  choosing  proper  concentration  for 
the  immobilized  oligonucleotide  (6,7)  and  also  by  increas- 
ing the  effective  length  of  immobilized  oligonucleotides 
by  adding  at  one  or  both  their  ends  5-nitroindole  as  a  uni- 
versal base  or  a  mixture  of  four  bases  (2). 

c)  Interference  of  hairpins  and  other  structures  in  DNA 
with  less  stable  duplexes  formed  upon  the  DNA  hybridiza- 
tion with  comparatively  short  immobilized  oligonucle- 
otides of  the  microchip.  This  interference  was  decreased 
by  fragmentation  of  the  analysed  sample  of  DNA  and  RNA 
in  the  course  of  incorporation  of  a  fluorescence  label  (4). 
We  have  also  tested  incorporation  by  a  chemical  bond  of 
an  intercalator  into  immobilized  oligonucleotides  that  sta- 
bilized its  base  paring  with  DNA  over  hairpin  formation 
(10). 

d)  Necessity  to  increase  the  microchip  complexity  for  se- 
quencing long  DNA  stretches.  As  an  alternative,  further 
development  of  so-called  contiguous  stacking  hybridiza- 
tion was  shown  to  improve  the  efficiency  of  8-raer  micro- 
chip up  to  that  of  13-mer  microchip  so  that  DNA  of  several 
kilobases  in  length  could  be  sequenced  by  SHOM  (2). 

e)  6-mer  microchips  for  sequencing  and  sequence  analysis. 
We  have  now  come  to  the  stage  of  manufacturing  micro- 
chips containing  4,096  (i.e.  all  possible)  6-mers.  The  con- 
trol tests  partly  described  above  have  shown  that  these  mi- 
crochips can  be  effectively  used  for  sequence  analysis, 
mutation  diagnostics  and  detection  of  sequencing  mistakes 
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by  conventional  gel-sequencing  methods.  We  hope  that 
after  demonstrating  the  efficiency  of  6-mer  microchips,  we 
shall  be  able  to  get  sufficient  financial  .support  for  produc- 
tion of  the  microchip  with  all  65,536  8-mers. 

2.  Mutation  diagnostics  and  gene  polymorphism  analysis 

The  improvements  described  above  have  been  introduced 
for  reliable  ("Yes"  or  "No"  mode)  identification  of 
single-base  changes  in  human  genomic  DNA.  The  effi- 
ciency of  SHOM  has  been  demonstrated  for  identification 
of  a  number  of  b-thalassemia  mutations  ( 1 ,2,8)  and  HLA 
allele  variations  in  the  human  genome. 

3.  Identification  of  microorganisms  and  gene  expression 
monitoring 

Bacterial  microchips  have  been  manufactured  and  tested. 
Their  ability  for  reliable  identification  of  a  number  of  bac- 
terial strains  in  the  sample  has  been  demonstrated  (6).  The 
chips  containing  oligonucleotides  complementary  to  spe- 
cific regions  of  16S  ribosomal  RNA  were  hybridized  with 
samples  of  rRNA,  total  RNA,  DNA  and  RNA  transcripts 
of  PCR-amplified  genomic  rDNA.  Similar  preliminary 
experiments  demonstrated  the  efficiency  of  SHOM  for 
monitoring  the  gene  expression. 

III.  Development  of  new  approaches  based  on  the 
SHOM  technology 

1.  Enzymatic  modification  of  nucleic  acids  on  selected  ele- 
ments of  the  oligonucleotide  chip.  The  gel  pads  of  the  oli- 
gonucleotide chip  are  separated  from  each  other  by  hydro- 
phobic glass  surface.  It  prevents  the  cross-talking  of  the 
chip  elements  when  a  drop  of  solution  is  applied  on  speci- 
fied elements.  At  the  same  time,  a  high  porosity  of  the  gel 
allows  diffusion  of  large  proteins  into  the  gel.  We  have 
demonstrated  that  immobilized  oligonucleotides  can  be 
enzymatically  phosphorylated  and  ligated  with  contigu- 
ously stacked  5-mer  after  hybridization  with  DNA.  A 
walking  sequencing  procedure  by  stacked  pentanucleotides 
was  proposed  that  is  based  on  enzymatic  ligation  and 
phosphorylation  on  oligonucleotides  chips  (9). 

2.  DNA  fractionation  on  oligonucleotide  chips.  Due  to  the 
same  properties,  the  oligonucleotide  chips  are  used  for 
fractionation  of  DNA  after  DNA  hybridization  with  some 
complementary  oligonucleotides  of  the  chip.  A  new  proce- 
dure for  sequencing  long  DNA  pieces  was  proposed  that  is 
based  on  fractionation  of  DNA  on  fractionating  oligo- 
nucleotide chips  followed  by  sequencing  of  the  isolated 
DNA  by  SHOM  on  sequencing  microchips.  The  procedure 
allows  the  investigator  to  skip  cloning  and  mapping  of 
long  DNA  pieces  (9). 

Conclusions 

It  appears  that  the  major  technical  problems  of  SHOM 
have  been  in  most  part  solved,  and  this  technology  can  al- 
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ready  be  applied  for  sequence  analysis  and  checking  the 
accuracy  of  conventional  sequencing  methods.  A  number 
of  other  applications  in  the  Himian  Genome  Program  are 
within  the  reach  of  SHOM,  such  as  mutation  screening, 
gene  polymorphism  studies,  detection  of  microorganisms, 
gene  expression  studies,  etc.  Application  of  SHOM  for  de 
novo  DNA  sequencing  requires  manufacturing  of  more 
complicated  microchips  and  improvement  of  some  other, 
already  available  methods. 

DOE  Contract  No.  W-3 1-1 09-Eng-38. 
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High-Throughput  DNA  Sequencing:  SAmple 
SEquencing  (SASE)  Analysis  as  a  Framework 
for  Identifying  Genes  and  Complete 
Large-Scale  Genomic  Sequencing 

Robert  K.  Moyzis  and  Jeffrey  K  GrifTith' 
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The  human  chromosome  5  and  16  physical  maps  (Doggen 
et  al.,  Nanire  377:Suppl:335-365,  1995;  Grady  et  al.. 
Genomics  32:91-96,  1996)  provide  the  ideal  framework 
for  initiating  large-scale  DNA  sequencing.  These  physical 
mapping  studies  have  shown  clearly  that  gene  density  in 
himians  will  vary  greatly.  For  example,  band  16q21,  con- 
sisting of  8  Mb  of  DNA,  has  no  genes  or  trapped  exons 
assigned  to  it.  as  yet.  In  contrast,  band  16pl3.3  has  an  ex- 
tremely high  density  of  coding  regions  in  the  DNA  exam- 
ined to  date  (i.e.,  multiple  genes/cosmid).  Given  this  wide 
variation  in  gene  density  and  current  sequencing  costs,  we 
propose  that  newly  targeted  genomic  regions  should  be 
analyzed  first  by  a  "Lewis  and  Clark"  exploratory  ap- 
proach, before  committing  to  full  length  DNA  sequencing. 
We  are  using  a  SAmple  SEquencing  (SASE)  approach  to 
rapidly  generate  aligned  sequences  along  the  chromosome 
5  and  16  physical  maps.  SASE  analysis  is  a  method  for 
rapidly  "scanning"  large  genomic  regions  with  minimal 
cost,  identifying,  and  localizing  most  genes.  Briefly,  indi- 
vidual cosmids  are  partially  digested  with  Sau3A  and  3  kb 
fragments  are  recloned  into  double-strand  sequencing  vec- 
tors. By  sequencing  both  ends  of  a  IX  sampling  of  these 
recloned  fragments  along  with  end  sequences  of  the 
cosmid,  70%  sequence  coverage  is  achieved  with  98% 
clone  coverage.  The  majority  of  this  clone  coverage  is  or- 
dered by  the  relationship  between  the  subclone  end  se- 
quences. These  ordered  sequences  are  ideal  substrates  for 
directed  sequencing  strategies  (for  example,  primer  walk- 
ing or  transposon  sequencing).  SASE  analysis  has  been 
initiated  on  the  40  Mb  short  arm  of  chromosome  1 6  and 
the  45  Mb  short  arm  of  chromosome  5.  We  propose  to 
make  SASE  sequences,  along  with  feature  annotation, 
publicly  available  through  GSDB.  Such  data  are  sufficient 
to  allow  PCR  amplification  of  the  sequenced  region  from 
GSDB  submissions  alone,  eliminating  the  need  for  exten- 
sive clone  archiving  and  distributing,  will  allow  for  the 
effective  "democratization"  of  the  genome,  allowing  nu- 
merous laboratories  to  share  and  contribute  to  the  growing 
genome  databases. 

DOE  Grant  No.  DE-FG03-96ER62298. 
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One-Step  PCR  Sequencing 

Kenneth  W.  Porter,  J.  David  Briiey,  and  Barbara  Ramsay 
Shaw 
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A  method  is  described  to  simultaneously  amplify  and  se- 
quence DNA  using  a  new  class  of  nucleotides  containing 
boron.  During  the  polymerase  chain  reaction, 
boron- modified  nucleotides,  i.e.  2'-deoxynucleoside 
5'-a-[P-borano] -triphosphates,'-^  are  incorporated  into  the 
product  DNA.  The  boranophosphate  linkages  are  resistant 
to  nucleases  and  thus  the  positions  of  the  borano- 
phosphates  can  be  revealed  by  exonuclease  digestion, 
thereby  generating  a  set  of  fragments  that  defines  the  DNA 
sequence.  The  boranophosphate  method  offers  an  alterna- 
tive to  current  PCR  sequencing  methods. 

Single-sided  primer  extension  with  dideoxynucleotide 
chain  terminators  is  avoided  with  the  consequence  that  the 
sequencing  fragments  are  derived  directly  from  the  origi- 
nal PCR  products.  Boranophosphate  sequencing  is  demon- 
strated with  the  Pharmacia  and  the  Applied  Biosystems 
373A  automatic  sequencers  producing  data  that  is  compa- 
rable to  cycle  sequencing. 

DOE  Grant  No.  DE-FG02-97ER62376  and  NIH  Grant  No. 
HGO0782. 
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Automation  of  the  Front  End  of  DNA 
Sequencing 

Lloyd  M.  Smith  and  Richard  A.  Guilfoyle 
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The  objective  of  this  project  is  to  continue  developing 
more  efficient  tools  and  methods  addressing  the 
"front-end"  processes  of  large-scale  DNA  sequencing.  Our 
specific  aims  are  high-throughput  purification  and  map- 
ping of  cosmid  inserts,  controlled  fragmentation  of  random 
inserts,  direct  selection  vectors  for  cloning  and  sequencing, 
high-throughput  M13  clone  isolations,  and 
high-throughput  template  purifications. 

An  approach  to  multi-cosmid  purifications  was  developed 
using  a  cell-harvester  and  binding  to  GF/C  glass  fiber 
filter-bottom  microtiter  plates.  This  method  proved  inad- 
equate because  the  yields  were  low  and  the  DNA  was  eas- 


ily fragmented.  In  the  last  year  we  have  started  examining 
the  use  of  triplex-affinity  capture  (TAC)  for  this  purpose  as 
applied  to  BACs,  based  on  our  previous  success  with  TAC 
purification  and  restriction  mapping  of  cosmids  ( 1 ,2). 

We  initially  proposed  to  control  random  fragmentation  for 
shotgun  cloning  using  CvUl  and  its  methyltransferase. 
Instead,  we  are  now  exploring  automating  it  by  scaled- 
down  nebulization  and  parallel  processing. 

We  have  made  a  vector,  M13-102  (3,4,  patented)),  for  fa- 
cilitating construction  and  improving  quality  of  Ml  3  shot- 
gun libraries.  It  allows  direct  selection  of  recombinants, 
dephosphorylation  of  inserts  to  reducing  chimerics,  con- 
tains universal  primers  for  fluorescent  sequencing,  and  a 
triplex  sequence  for  easy  TAC  purification  of  linearized 
RF  DNA.  We  also  made  a  version  of  this  vector, 
MI3-100Z,  which  expressed  the  alpha-peptide  of  B-gal.  Its 
utility  is  in  flow  cytometry  based  clone  isolation.  We  con- 
tinue to  develop  these  vectors  for  multiple  cloning  sites, 
and  insert  flipping  using  in  closing  steps  of  large-scale  se- 
quencing projects. 

We  continue  to  develop  high-throughput  clone  isolations 
by  flow  cytometric  cell  sorting.  Ml 3  or  plasmid  clones  can 
theoretically  be  isolated  at  rates  in  microtiter  wells  at  rates 
up  to  2  per  second  using  our  present  FacStar-Plus  cytom- 
eter  and  collection  assembly.  Theoretical  rates  are  much 
higher.  This  bypasses  plating  onto  solid-media  and  any 
need  for  plaque/colony  picking.  We  initially  tried  isola- 
tions after  microencapsulation  of  cells  in  agarose  gel 
microbeads,  but  with  H/W  and  S/W  improvements  we  can 
now  distinguish  positively  selected  transfected  cells  from 
background.  Efficiency  of  sorting  is  very  sensitive  to  de- 
tection efficiency.  We  continue  to  investigate  different 
methods  of  florescence  detection  for  various  plasmid  and 
M13  vector  systems  including  fluorogenic  substrates  for 
B-gal,  fluorescent-tagged  antibodies  to  M13  or  cell  surface 
proteins,  and  green  fluorescent  protein  as  a  reporter. 

We  have  been  developing  a  solid-phase  filter  plate  method 
for  Ml  3  template  purifications  using  carboxylated  polysty- 
rene beads  (Bangs  Labs,  IN)  for  automating  on  the 
Hamilton  2200.  It  should  process  96  samples  in  under  30 
minutes  and  deliver  I  -2  micrograms  per  sample  for 
cycle-sequencing.  This  approach  has  proven  superior  to 
others  we  have  tried  with  respect  to  amenability  to  auto- 
mation (5,6). 

Ancillary  projects.  We  reported  a  method  for  direct  fluo- 
rescence analysis  of  genetic  polymorphisms  using  oligo- 
nucleotide arrays  on  glass  supports  (7),  which  spun  off 
other  projects  including  (a)  enhanced  discrimination  by 
artificial  mismatch  hybridization  (8),  restriction  hybridiza- 
tion ordering  of  shotgun  clones,  and  restriction  site 
indexing-PCR  (RSI-PCR)  (9,  patent  applied  for).  RSI-PCR 
is  an  alternative  strategy  to  extra-long  PCR  which  has 
application  in  large  gap  filling  (>45kb)  differential 
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gene  expression  analysis,  RFLP  and  EST  marker  produc- 
tion, end-sequencing  and  others. 

Our  most  significant  fmdings  are  the  following: 

1 .  Improved  direct  selection  M 1 3  cloning  vector 

2.  Rapid  restriction  mapping  of  cosmids  using 
triple-helix  affinity  capture 

3.  High-throughput  M13  template  production  using  car- 
boxylated  beads 

4.  Sequencing  of  a  cosmid  encoding  the  Drosophila 
GABA  receptor 

5.  Improved  detection  of  sequencing  clones  by 
flow-cytometry 

6.  RSI-PCR,  a  strategy  to  obtain  mapped  and 
sequence-ready  DNA  directly  from  up  to  0.5  kb  re- 
gions of  a  complex  genome  using  palindromic  class  II 
restriction  enzymes;  bypasses  conventional  cloning 
methodology  (see  previous  section  for  applications). 

DOE  Grant  No.  DE-FG02-91ER6U22. 
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High-Speed  DNA  Sequence  Analysis  by 
Matrix-Assisted  Laser  Desorption 
Mass  Spectrometry 
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Our  mass  spec  research  has  focused  primarily  on  the  possi- 
bility of  utilizing  Matrix- Assisted  Laser  Desorption/Ioniza- 
tion  Mass  Spectrometry  (MALDI-MS)  as  an  alternative 
method  to  conventional  gel  electrophoresis  for  DNA  se- 
quence analysis.  In  this  approach,  extension  fragments  gen- 
erated by  the  Sanger  sequencing  reactions  are  separated  by 
size  and  detected  in  the  mass  spectrometer  in  one  step. 

Our  group  has  shown  fragmentation  to  be  a  major  factor 
limiting  accessible  mass  range,  sensitivity,  and  mass  reso- 
lution in  the  analysis  of  DNA  by  MALDI-MS.  This  DNA 


fragmentation  was  shown  to  be  strongly  dependent  on  both 
the  MALDI  matrix  and  the  nucleic  acid  sequence  em- 
ployed. Fragmentation  is  proposed  to  follow  a  pathway  in 
which  nucleobase  protonation  leads  to  cleavage  of  the 
N-glycosidic  bond  with  base  loss,  followed  by  cleavage  of 
the  phosphodiester  backbone.  Modifications  of  the  deox- 
yribose  sugar  ring  by  replacing  the  2'  hydrogen  with  more 
electron-withdrawing  groups  such  as  the  hydroxyl  or 
fluoro  group  were  shown  to  stabilize  the  N-glycosidic 
bond,  partially  or  completely  blocking  fragmentation  at  the 
mtxlified  nucleosides.  The  stabilization  provided  by  these 
chemical  modifications  was  also  shown  to  expand  the 
range  of  matrices  useful  for  nucleic  acid  analysis,  yielding 
in  some  cases  greatly  improved  performance. 

DOE  Grant  No.  DE-FG02-91ER61130. 
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Analysis  of  Oligonucleotide  Mixtures 
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This  project  aims  to  develop  electrospray  ionization  mass 
spectrometry  (ESl-MS)  methods  for  high  speed  DNA  se- 
quencing of  oligonucleotide  mixtures,  that  can  be  inte- 
grated into  an  effective  overall  sequencing  strategy.  A  sec- 
ond goal  is  develop  mass  spectrometric  methods  that  can 
be  effective  utilized  in  post  genomic  research  in  broad  ar- 
eas of  DNA  characterization,  such  as  with  polymerase 
chain  reaction  to  rapidly  and  accurately  identify  single 
base  polymorphisms.  ESI  produces  intact  molecular  ions 
from  DNA  fragments  of  different  size  and  sequence  with 
high  efficiency  [1].  Our  aim  is  to  determine  ESI  mass 
spectrometry  conditions  that  are  compatible  with  biologi- 
cal sample  preparation  to  allow  efficient  ionization  of 
DNA  and  allowing  for  the  analysis  of  complex  mixtures 
(e.g.,  Sanger  sequencing  ladder).  We  have  developed  a 
novel  on-line  microdialysis  method  at  PNNL  to  remove 
salts,  detergents,  and  buffers  from  such  biological  prepara- 
tions as  PCR  and  dideoxy  sequencing  mixtures.  This  has 
allowed  for  rapid  and  efficient  desalting  (e.g.,  of  samples 
having  0.25  M  NaCl)  allowing  ESI  mass  spectral  analysis 
without  the  typically  problematic  Na-adducts  observed. 
Oligonucleotide  ions  are  typically  produced  from  ESI  with 
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a  broad  distribution  of  net  charge  states  for  each  molecular 
species,  and  thus  leading  to  difficulties  in  analysis  of  com- 
plex mixtures  [1].  To  make  identification  of  each  compo- 
nent in  a  sequencing  mixture  possible,  the  charge  states  of 
molecular  ions  can  be  reduced  using  gas-phase  reactions. 
The  charge-state  reduction  methods  being  examined  in- 
clude; (1)  reactions  with  organic  acids  and  bases  (in  the 
solution  to  be  electrosprayed  and  the  ESI-MS  interface  or 
the  gas  phase);  (2)  the  labeling  of  the  oligonucleotides 
with  a  designed  functional  group  for  production  of  mo- 
lecular ions  of  very  low  charge  states;  and  (3)  the  shielding 
of  potential  charge  sites  on  the  oligonucleotide  phosphate/ 
phosphodiester  groups  with  polyamines  (and  the  subse- 
quent gas-phase  removal  of  the  neutral  amines).  In  initial 
studies  two  methods  for  charge  state  reduction  of  gas 
phase  oligonucleotide  negative  ions  have  been  tested:  (1) 
the  addition  of  acids  and  bases  to  the  oligonucleotide  solu- 
tion and  (2)  the  formation  of  diamine  adducts  followed  by 
dissociation  in  the  interface  region  (2,3].  Several  methods 
show  promise  for  charge  state  reduction  and  results  have 
been  demonstrated  for  series  of  smaller  oligonucleotides. 
We  have  recently  demonstrated  for  the  first  time  that  PCR 
products  can  be  rapidly  detected  using  ESI-MS  with  sig- 
nificant improvements  projected  [4,5].  Finally,  new  mass 
spectrometric  methods  have  been  developed  to  provide  the 
dynamic  range  expansion  necessary  for  addressing  DNA 
sequencing  mixtures  [6].  Our  overall  aim  is  to  provide  a 
foundation  for  the  development  of  an  overall  approach  to 
high  speed  sequencing  (including  the  rapid  and  precise 
PCR  product  characterization)  using  cost  effective 
high-throughput  instrumentation. 
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This  project  is  aimed  at  the  development  of  a  totally  new 
concept  for  high  speed  DNA  sequencing  based  upon  the 
analysis  of  single  (i.e.,  individual)large  DNA  fragments 
using  electrospray  ionization  (ESI)  combined  with  Fourier 
transform  ion  cyclotron  resonance  (FTICR)  mass  spec- 
trometry. In  our  approach,  large  single-stranded  DNA  seg- 
ments extending  to  as  much  as  25  kilobases  (and  possibly 
much  larger),  are  transferred  to  the  gas  phase  using  ESI. 
The  multiply-charged  molecular  ions  are  trapped  in  the 
cell  of  an  FTICR  mass  spectrometer,  where  one  or  more 
single  ion(s)  are  then  selected  for  analysis  in  which  its 
mass-to-charge  ratio  (m/z)  is  measured  both  rapidly  and 
non-destructively.  Single  ion  detection  is  achievable  due  to 
the  high  charge  state  of  the  electrosprayed  ions  and  the 
unique  sensitivity  of  new  FTICR  detection  methodologies. 

Initial  efforts  under  this  project  have  demonstrated  the  ca- 
pability for  the  formation,  extended  trapping,  isolation, 
and  monitoring  of  sequential  reactions  of  highly  charged 
DNA  molecular  ions  with  molecular  weights  well  into  the 
megadalton  range  [1-6].  We  have  shown  that  large 
multiply-charged  individual  ions  of  both  single  and 
double-stranded  DNA  anions  can  also  be  efficiently 
trapped  in  an  FTICR  cell,  and  their  mass-to-charge  ratios 
measured  with  very  high  accuracy.  Thus,  it  is  feasible  to 
quickly  determine  the  mass  of  each  lost  unit  as  the  DNA  is 
subjected  to  rapid  reactive  degradation  steps.  One  ap- 
proach is  to  develop  methods  based  upon  the  use  of 
ion-molecule  or  photochemical  processes  that  can  promote 
a  stepwise  reactive  degradation  of  gas-phase  DNA  anions. 
Successful  development  of  one  of  these  approaches  could 
greatly  reduce  the  cost  and  enhance  the  speed  of  DNA  se- 
quencing, potentially  allowing  for  sequencing  DNA  seg- 
ments of  more  than  25  kilobase  in  length,  on  a  time  scale 
of  minutes  with  negligible  error  rates  with  the  added  po- 
tential for  conducting  many  such  measurements  in  parallel. 
Instrumentation  optimized  for  these  purposes  is  currently 
being  introduced  and  promises  to  greatly  advance  the 
methodology.  The  techniques  being  developed  promise  to 
lead  to  a  host  of  new  methods  for  DNA  characterization, 
potentially  extending  to  the  size  of  much  larger  DNA  re- 
striction firagments  (>500  kilobases). 
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Our  studies  are  directed  towards  improving  the  properties 
of  DNA  polymerases  for  use  in  DNA  sequencing.  The  pri- 
mary focus  is  understanding  the  mechanism  by  which 
DNA  polymerases  discriminate  against  nucleotide  analogs, 
and  the  mechanism  by  which  they  incorporate  nucleotides 
processively  without  dissociating  from  the  DNA  template. 

We  are  comparing  three  DNA  polymerases  that  have  been 
used  extensively  for  DNA  sequencing;  E.  coli  DNA  poly- 
merase I,  T7  DNA  polymerase,  and  Taq  DNA  polymerase. 
These  are  related  to  one  another,  and  this  homology  has 
been  exploited  to  construct  active  site  hybrids  that  have 
been  used  to  determine  the  structural  basis  for  differences 
in  their  activities.  Specifically,  the  hybrids  have  been  used 
(1)  to  determine  why  E.  coli  DNA  polymerase  I  and  Taq 
DNA  polymerase  discriminate  strongly  against 
dideoxynucleotides,  and  (2)  to  understand  how  T7  DNA 
polymerase  interacts  with  its  processivity  factor, 
thioredoxin,  to  confer  high  processivity. 

Based  on  these  smdies,  we  have  been  able  to  modify  Taq 
DNA  polymerase  and  E.  coli  DNA  polymerase  I  to  make 
them  incorporate  dideoxynucleotides  much  more  effi- 


ciently, and  to  have  increased  processivity  in  the  presence 
of  thioredoxin.  The  ability  to  incorporate 
dideoxynucleotides  efficiently  greatly  improves  the  unifor- 
mity of  band  intensities  on  a  DNA  sequencing  gel,  thereby 
increasing  the  accuracy  of  the  DNA  sequence  obtained.  In 
addition,  the  efficient  use  of  dideoxynucleotides  reduces 
the  amount  of  these  analogs  required  for  DNA  sequencing, 
an  important  issue  when  using  fluorescently  modified 
dideoxy  terminators.  In  an  approach  that  complements 
these  studies,  we,  in  collaboration  with  Dr.  Thomas 
Ellenberger  (Harvard  Medical  School),  are  determining  the 
crystal  structure  of  T7  DNA  polymerase  in  a  complex  with 
thioredoxin  and  a  primer-template.  Knowledge  of  this 
structure  will  allow  the  rationale  design  of  specific  muta- 
tions that  will  enable  DNA  polymerases  to  incorporate 
other  analogs  useful  for  DNA  sequencing  more  efficiently, 
such  as  those  with  fluorescent  moieties  on  the  bases. 
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We  are  developing  molecular  approaches  to  DNA  sequenc- 
ing enabling  primer  walking  without  the  step  of  chemical 
synthesis  of  oligonucleotide  primers  between  the  walks. 
One  such  approach  involves  "modular  primers"  described 
earlier,  consisting  of  5-mers,  6-mers  or  7-mers  (selected 
from  a  presynthesized  library),  annealing  to  the  template 
contiguously  with  each  other.  Another  approach,  that  we 
have  termed  DENS  (Differential  Extension  with  Nucle- 
otide Subsets),  works  by  selectively  extending  a  short 
primer,  making  it  a  long  one  at  the  intended  site  only. 
DENS  starts  with  a  limited  initial  extension  of  the  primer 
(at  20-30  C)  in  the  presence  of  only  2  out  of  the  4  possible 
dNTPs.  The  primer  is  extended  by  6-9  bases  or  longer  at 
the  intended  priming  site,  which  is  deliberately  selected, 
(as  is  the  two-dNTP  set),  to  maximize  the  extension 
length.  The  subsequent  sequencing/termination  reaction  at 
60-65  C  then  accepts  the  extended  primer  at  the  intended 
site,  but  not  at  alternative  sites,  where  the  initial  extension 
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(if  any)  is  generally  much  shorter.  DENS  allows  the  use  of 
primers  as  long  as  8-mers  (degenerate  in  2  positions) 
which  prime  much  more  strongly  than  modular  primers 
involving  5-7  mers  and  which  (unlike  the  latter)  can  be 
used  with  thermostable  polymerases,  thus  allowing 
cycle-sequencing  with  dye-terminators  for  Taq,  as  well  as 
making  double-stranded  DNA  sequencing  more  robust. 

These  technologies  are  expected  to  speed  up  genome  se- 
quencing in  more  than  one  way: 

a)  Reduction  in  redundancy  would  result  firom  more  effi- 
cient and  rapid  closure  of  even  long  gaps  which  are  cur- 
rently avoided  at  the  price  of  7-to  9-fold  redundancy  in 
shotgun.  Instantly  available  primers  would  also  improve 
the  quality  of  sequencing.  Stretches  of  sequence  that  have 
too  low  confidence  level  (high  suspected  error  rate)  can  be 
resequenced  without  synthesizing  new  oligos  and  without 
growing  any  new  subclones. 

b)  Further  down  the  road,  the  completion  of  the  automa- 
tion of  the  closed  cycle  of  primer  walking  will  be  made 
possible  via  the  elimination  of  the  need  to  synthesize  the 
walking  primers.  Combined  with  the  capillary  sequencers, 
the  instant  availability  of  the  walking  primers  should  re- 
duce the  time  per  walking  cycle  from  2-3  days  now  to 
about  1 .5-2.0  hours,  an  improvement  in  speed  by  a  factor 
of  20-50. 

c)  The  closed-end  automation  would  minimize  both  the 
labor  cost  and  human  errors.  As  primer  walking  has  mini- 
mal, if  any,  front-end  and  back-end  bottlenecks  inherent  to 
shotgun,  the  cost  of  sequencing  would  be  essentially  that 
of  reagents,  5  cents/base  or  less. 
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There  are  three  potential  roles  for  mass  spectrometry  rel- 
evant to  the  Human  Genome  Project: 

a)  The  most  obvious  role  is  that  on  which  all  groups  have 
been  focussing  -development  of  an  alternative,  faster  se- 
quence ladder  readout  method  to  speed  up  large-scale  se- 
quencing. Progress  here  has  been  difficult  and  slow  be- 
cause the  mass  spectrometry  requirements  exceed  the  cur- 
rent capabilities  of  mass  spectrometry  even  for  proteins, 
and  DNA  presents  significantly  more  difficulty  than  pro- 
teins. We  have  shown  previously  that  pulsed  laser  ablation 
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of  DNA  from  frozen  aqueous  films  has  the  potential  to 
yield  sequence-quality  mass  spectra,  but  that  ionization  in 
this  approach  is  erratic  and  uncontrollable.  We  are  focus- 
sing on  developing  ionization  methods  using  ion  (or  elec- 
tron) attachment  to  vapor-phase  DNA  (ablated  from  ice 
films)  in  an  electric  field-free  environment;  results  of  this 
approach  will  be  reported. 

b)  Mass  spectrometry  may  not  ultimately  compete  favor- 
ably in  speed  with  large-scale  multiplexing  of  conven- 
tional or  near-term  technologies  such  as  capillary  electro- 
phoresis. However,  as  the  Genome  project  nears  comple- 
tion there  will  be  an  increasing  need  for  rapid  small-scale 
DNA  analysis,  where  the  multiplex  advantage  will  not  be 
so  great  and  mass  spectrometry  could  play  a  more  signifi- 
cant role  there.  With  this  in  mind  we  are  looking  at  ways  to 
speed  up  the  overall  mass  spectrometric  analysis,  e.g. 
simple  rapid  cleanup  of  sequence  mixtures,  and  at  genera- 
tion of  short  sequence  ladders  by  exopeptidase  digestion. 

c)  Given  the  genome  data  base(s)  at  the  completion  of  the 
project,  with  rapid  search  capability,  a  need  will  arise  for 
comparably  rapid  generation  of  search  input  data  to  iden- 
tify often  very  small  quantities  of  proteins  isolated  from 
biochemical  investigations.  With  this  in  mind  we  have  de- 
veloped extremely  rapid  enzyme  digestion  techniques  opti- 
mized for  mass  spectrometric  readout,  using  endopepti- 
dases  covalently  coupled  directly  to  the  mass  spectrometer 
probe  tip.  The  elimination  of  autolysis  and  transfer  losses 
allows  rapid  (few  minute)  endopeptidase  digestion  and 
mass  analysis  of  as  little  as  I  picomole  of  protein,  leading 
to  an  ambiguous  database  identification.  An  alternative 
search  procedure  uses  partial  amino-acid  sequence  infor- 
mation. With  the  added  use  of  exopeptidases  to  generate  a 
peptide  ladder  sequence  in  the  mass  spectrum  of  the  en- 
dopeptidase digest,  on  the  order  of  a  dozen  residues  of  in- 
ternal sequence  can  be  generated  in  a  total  analysis  time  of 
20  minutes  or  less,  again  using  only  picomoles  of  sample. 

DOE  Grant  No.  DE-FG02-91ER61127. 
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We  have  developed  novel  separation,  detection,  and  imag- 
ing techniques  for  real-time  monitoring  in  capillary  elec- 
trophoresis. These  techniques  will  be  used  to  substantially 
increase  the  speed,  throughput,  reliability,  and  sensitivity 
in  DNA  sequencing  applications  in  highly  multiplexed 
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capillary  arrays.  We  estimate  that  it  should  be  possible  to 
eventually  achieve  a  raw  sequencing  rate  of  40  million 
bases  per  day  in  one  instrument  based  on  the  standard 
Sanger  protocol.  We  have  reached  a  stage  where  an  actual 
sequencing  instrument  with  100  capillaries  can  be  built  to 
replace  the  Applied  Biosystems  373  or  377  instruments, 
with  a  net  gain  in  speed  and  throughput  of  100-fold  and 
24-fold,  respectively. 

The  substantial  increase  in  sequencing  rate  is  a  result  of 
several  technical  advances  in  our  laboratory.  (1)  The  use  of 
commercial  linear  polymers  for  sieving  allows  replaceable 
yet  reproducible  matrices  to  be  prepared  that  have  lower 
viscosity  (thus  faster  migration  rates)  compared  to  poly- 
acrylamide.  (2)  The  use  of  a  charge-injection  device  camera 
allows  random  data  acquisition  to  decrease  data  storage  and 
data  transfer  time.  (3)  The  use  of  distinct  excitation  wave- 
lengths and  cut-off  emission  filters  allows  maximum  light 
throughput  for  efficient  excitation  and  sensitive  detection 
employing  the  standard  4-dye  coding.  (4)  The  use  of 
indexmatching  and  1 : 1  imaging  reduces  stray  light  without 
sacrificing  the  convenience  of  on-column  detection. 


Continuing  efforts  include  further  optimization  of  the 
separation  matrix,  development  of  new  column  condition- 
ing protocols,  refinement  of  the  excitation/emission  optics, 
design  of  a  pressure  injection  system  for  96-well  titer 
plates,  validation  of  a  new  2-color  base-calling  scheme, 
simplification  of  software  to  allow  essentially  real-time 
data  processing,  implementation  of  voltage  programming 
to  shorten  the  total  run  times,  and  scale  up  of  the  technol- 
ogy to  allow  parallel  sequencing  in  up  to  1,000  capillaries. 
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Separated  by  Capillary  Electrophoresis  in  a  Multiplexed  Array  of  100 

Channels",  Anal.  Chem.  66.  1424-1431  (1994). 
X.  Lu  and  E.  S.  Yeung.  "Optimization  of  Excitation  and  Detection 

Geometry  for  Multiplexed  Capillary  Array  Electrophoresis  of  DNA 

Fragments".  Appl.  Speclrosc.  49.  605-609  ( 1995). 
Q.  Li  and  E.  S.  Yeung.  "Evaluation  of  the  Potential  of  a  Charge  Injection 

Device  for  DNA  Sequencing  by  Multiplexed  Capillary  Electrophore- 
sis". Appl.  Spectrosc.  49,  825-833  (1995). 
E.  N.  Fung  and  E.  S.  Yeung.  "High-Speed  DNA  Sequencing  by  Using 

Mixed  Poly(ethylcneoxide)  Solutioas  in  Uncoatcd  Capillary 

Columns."  Anal.  Chem.  67.  1913-1919(1995). 
Q.  Li  and  E.  S.  Yeung,  "Simple  Tv^o-Color  Ba.se-CalUng  Schemes  for 

DNA  Sequencing  Based  on  Standard  4-L.abel  Sanger  Chemistry", 

Appl.  Specti^osc.  49.  1528-1533  (1995). 
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Our  work  for  1996  and  1997  will  include  the  following: 

1 .  Comparative  study  of  the  kinetics  of  entry  of  DNA  of 
different  molecular  forms  into  E.coli  cells  DH  lOB/r  and 
DH5a  during  electrotransformation.  Study  of  the  optimal 
regimes  of  cell-wall  permeabilization  for  the  DH  lOB/r  cells. 

2.  Study  of  the  efficiency  of  BAC  cloning  in  DHlOB/r 
cells  using  new  electrotransformation  method.  Optimiza- 
tion of  the  procedure  for  DH  lOB/r  cells. 

3.  Modernization  of  the  electronic  equipment  in  accor- 
dance with  results  of  the  biological  experiments.  To  ex- 
pand the  studies,  we  need  to  extend  the  capability  of  the 
instrumentation  to  increase  its  flexibility  and  to  improve 
the  accuracy  and  reproducibility  of  the  electric  fields  we 
generate  by  incorporating  electronic  components  with 
higher  tolerances. 

DOE  Grant  No.  OR00033-93CIS0I5. 

Overcoming  Grenome  Mapping 
Bottlenecks 

Charles  R.  Cantor 

Center  for  Advanced  Biotechnology;  Boston  University; 

Boston  MA  02215 

617/353-8500,  Fax:  8501,  crc@enga.bu.edu 

http://eng.bu.edu/CAB 

Most  traditional  DNA  analysis  is  done  based  on  fraction- 
ation of  DNA  by  length.  We  have,  instead,  begun  to  ex- 
plore the  use  of  DNA  sequences  as  capture  and  detection 
methods  to  expedite  a  number  of  procedures  in  genome 
analysis. 

Triplet  repeats  like  (GGC)^  are  an  important  class  of  hu- 
man genetic  markers,  and  they  are  also  responsible  for  a 
number  of  inherited  diseases  involving  the  central  nervous 
system.  For  both  of  these  reasons  it  would  be  very  useful 
to  have  a  way  to  monitor  the  status  of  large  numbers  of 
triplet  repeats  simultaneously.  We  are  developing  methods 
to  isolate  and  profile  classes  of  such  repeats. 

In  one  method,  genomic  DNA  is  cut  with  one  or  more  re- 
striction nucleases,  and  splints  are  ligated  onto  the  ends  of 
the  fragments.  Then  fragments  containing  a  specific  class 
of  repeats  are  isolated  by  capture  on  magnetic  microbeads 
containing  an  immobilized  simple  repeating  sequence.  The 
desired  material  is  then  released,  and,  if  necessary,  a  selec- 
tive PCR  is  done  to  reduce  the  complexity  of  the  sample. 
Otherwise  the  entire  captured  sample  is  amplified  by  PCR. 
The  spectrum  of  repeats  is  then  examined  by  electrophore- 
sis on  an  automated  fluorescent  gel  reader  In  our  case  the 
Pharmacia  ALF  is  used,  because  of  its  excellent  quantita- 
tive signal  accuracy.  A  very  complex  spectrum  of  bands  is 

•Projects  destgnaled  by  an  a.slcrisk  received  small  emergency  grants  following  December  1992  site  reviews  by  David  Gala.s  (formerly  DOE  Office  of 
Health  and  Environmental  Research,  which  was  renamed  Office  of  Biological  and  Environmental  Research  in  1997).  Raymond  Gesteland  (University 
of  Utah),  and  Elbert  Branscomb  (Lawrence  Livermorc  National  Laboratory). 


Resolving  Proteins  Bound  to  Individual 
DNA  Molecules 

David  Allison,  Bruce  Warmack.  Mitch  Doktycz,  Tom 

ITiundat,  and  Peter  Hoyt 

Molecular  Imaging  Group;  Health  Sciences  Research 

Division;  Oak  Ridge  National  Laboratory;  Oak  Ridge,  TN 

37831-6123 

Allison:  423/574-6199,  Fax:  -6210,  aUisondp@oml.gov 

Warmack;  423/574-6202,  Fax:  -6210,  rjw@oml.gov 

We  have  precisely  located  sequence  specific  proteins 
bound  to  individual  DNA  molecules  by  direct  AFM  imag- 
ing. Using  a  mutant  £coR  I  endonuclease  that  site-specifi- 
cally binds  but  doesn't  cleave  DNA,  bound  enzyme  has 
been  imaged  and  located,  with  an  accuracy  of  ±1%,  on 
well  characterized  plasmids  and  bacteriophage  lambda 
DNA  (48  kb).  Cosmids  have  been  mapped  and,  by  incor- 
porating methods  for  anchoring  molecules  to  surfaces  and 
straightening  to  prevent  molecular  entanglement,  BAC- 
sized  clones  could  be  analyzed. 

This  direct  imaging  approach  could  be  rapidly  developed 
to  locate  other  sequence-specific  proteins  on  genomic 
clones.  Enzymatic  proteins,  involved  in  identifying  and 
repairing  damaged  or  mutated  regions  on  DNA  molecules, 
could  be  imaged  bound  to  lesion  sites.  Transcription  factor 
proteins  that  identify  gene-start  regions  and  other  regula- 
tory proteins  that  modulate  the  expression  of  genes  by 
binding  to  specific  control  sequences  on  DNA  molecules 
could  be  precisely  located  on  intact  cloned  DNAs. 

Conventional  gel-based  techniques  for  identifying  site- 
specific  protein  binding  sites  must  rely  upon  fragment 
analysis  for  identifying  restriction  enzyme  sites,  or,  for  non- 
cutting  proteins,  upon  gel-shift  methods  that  can  only  ad- 
dress small  DNA  fragments.  Conversely,  AFM  imaging  is 
a  general  approach  that  is  applicable  to  the  analysis  of  all 
site-specific  DNA  protein  interactions  on  large-insert  clones. 
This  technique  could  be  developed  for  high-throughput 
analysis,  can  be  accomplished  by  technicians,  uses  readily 
available  relatively  inexpensive  instrumentation,  and  should 
be  a  technology  fully  transferable  to  most  laboratories. 

DOE  Contract  No.  DE-AC05-840R2I400. 


^Improved  Cell  Electrotransformation 
by  Macromolecules 

Alexandre  S.  Boitsov,  Boris  V.  Oskin,  Anton  O.  Reshetin, 

and  Stepan  A.  Boitsov 

Department  of  Biophysics;  StPetersburg  State  Technical 

University;  195251  St  Petersburg,  Russia 

-H7-8 12/277-5959,  Fax:  /247-2088  or/534-3314, 

sasha@biopKhop.stu.neva.ru 
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seen  representing  hundreds  of  DNA  fragments.  We  have 
shown  that  this  spectrum  is  dramatically  different  with 
DNAs  from  unrelated  individuals,  and  the  spectrum  is 
markedly  dependent  on  the  choice  of  restriction  enzyme, 
as  expected.  Repeated  measurements  on  the  same  sample 
are  highly  reproducible.  The  ability  of  the  method  to  detect 
a  specific  altered  repeat  length  in  a  complex  DNA  sample 
has  been  validated  by  examining  several  individuals  with 
normal  or  expanded  repeat  sequences  in  the  Huntington's 
disease  gene.  One  very  powerful  application  of  this 
method  may  be  the  analysis  of  potential  DNA  differences 
in  monozygotic  twins  discordant  for  a  genetic  disease. 
This  method  can  be  used  to  capture  genome  subsets  con- 
taining any  interspersed  repeat.  It  will  also  detect  inser- 
tions and  deletions  nearby  such  repeats.  Methylation  dif- 
ferences between  sensitive  methylation  samples  are  also 
detectable  when  restriction  fragments  are  used. 

Conventional  analysis  of  triplet  repeats  is  very  laborious 
since  individual  repeats  must  be  analyzed  by  electrophore- 
sis on  DNA  sequencing  gels.  The  decrease  in  effort  for 
such  analyses  will  scale  linearly  as  the  number  of  repeats 
that  can  be  analyzed  simultaneously,  so  we  are  potentially 
looking  at  something  like  a  factor  of  100  improvement  if 
the  above  scheme  under  development  can  be  effectively 
realized. 

As  an  alternative  approach,  we  are  developing  chip-based 
methods  that  can  detect  the  length  of  a  tandemly-repeating 
sequence  without  any  need  for  gel  electrophoresis.  Here 
the  goal  is  to  build  an  array  of  all  possible  repeat  sequence 
lengths  flanked  by  single-copy  DNA.  When  an  actual 
sample  is  hybridized  to  such  an  array,  the  specific  alleles 
in  the  sample  will  produce  perfect  duplexes  at  their  corre- 
sponding points  in  the  array  and  at  mismatched  duplexes 
elsewhere.  Thus,  the  task  of  scoring  the  repeat  lengths  is 
reduced  to  the  task  of  distinguishing  perfect  and  imperfect 
duplexes.  Currently  we  are  exploring  a  number  of  different 
enzymatic  protocols  that  offer  the  promise  of  making  such 
distinctions  reliably. 

In  other  work  we  are  using  enzyme-enhanced  sequencing 
by  hybridization  (SBH)  as  a  device  for  the  rapid  prepara- 
tion of  DNA  samples  for  mass  spectrometry.  For  example, 
partially  duplex  DNA  probes  can  capture  and  generate  se- 
quence ladders  from  any  arbitrary  DNA  sequence.  Current 
MALDI  protocols  allow  sequence  to  be  read  to  lengths  of 
50  to  60  bases.  While  this  is  probably  insufficient  for  most 
de  novo  DNA  sequencing,  it  is  an  extremely  promising 
approach  for  comparative  or  diagnostic  DNA  sequencing. 

DOE  Grant  No.  DE-FG02-93ER6 1609. 


Preparation  of  PAC  Libraries 

Joe  Catanese,  Baohui  Zhao.  Eirik  Frengen.  Chenyan  Wu. 

Xiaoping  Guan.  Chira  Chen.  Eugenia  Pletrzak, 

Panayotis  A.  loannou,'  Julie  Korenberg,*  Joel  Jessee,'  and 

Pieter  J.  de  Jong 

Department  of  Human  Genetics;  Roswell  Park  Cancer 

Institute;  Buffalo,  NY  14263 

de  Jong:  716/845-3168.  Fax:  -8849 

pieter@  dejong.  med.  buffalo,  edu 

http://bacpac.  med  buffalo,  edu 

'The  Cyprus  Institute  of  Neurology  and  Genetics;  Nicosia, 

Cyprus 

'Cedars  Sinai  Medical  Center;  Los  Angeles,  CA  90048 

'Life  Technologies.  Gaithersburg.  MD  20898 

Recently,  we  have  developed  procedures  for  the  cloning  of 
large  DNA  fragments  using  a  bacteriophage  PI  derived 
vector.  pCYPACI  (loannou  et  al.  (1994).  Nature  Genetics 
6:  84-89).  A  slightly  modified  vector  (pCYPAC2)  has  now 
been  used  to  create  a  1 5-fold  redundant  PAC  library  of  the 
human  genome,  arrayed  in  more  than  1.000  384-well 
dishes.  DNA  was  obtained  from  blood  lymphocytes  from  a 
male  donor  The  library  was  prepared  in  four  distinct  sec- 
tions designated  as  RPCI- 1 .  RPCI-3.  RPCI-4  and  RPCI-5. 
respectively,  each  having  120  kbp  average  inserts.  The 
RPCI- 1  segment  of  the  library  (3X;  120.000  clones,  in- 
cluding 25%  non-recombinant)  has  been  distributed  to 
over  40  genome  centers  worldwide  and  has  been  used  in 
many  physical  mapping  studies,  positional  cloning  efforts 
and  in  various  large-scale  DNA  sequencing  enterprises. 
Screening  of  the  RPCI-1  library  by  numerous  markers  re- 
sults in  an  average  of  3  positive  PACs  per  autosome- 
derived  probe  or  STS  marker.  In  situ  hybridization  results 
with  250  PAC  clones  indicate  that  chimerism  is  low  or 
non-exi.sting.  Distribution  of  RPCI-3  (3X.  78,000  clones, 
less  than  \%  non-recombinants,  4%  empty  wells)  is  now 
underway  and  the  further  RPCI-4  and  -5  segments  (<  5% 
empty  wells)  will  be  distributed  upon  request.  To  facilitate 
screening  of  the  PAC  library,  we  have  provided  the  RPCI- 1 
PAC  library  to  several  screening  companies  and  noncommer- 
cial resource  centers.  In  addition,  we  are  now  distributing 
high-density  colony  membranes  at  cost-recovery  price, 
mainly  to  groups  having  a  copy  of  the  PAC  library.  The 
combined  RPCI- 1  and  -3  segments  (6X)  can  be  repre- 
sented on  1 1  colony  filters  of  22x22  cm,  using  duplicate 
colonies  for  each  clone.  We  are  currently  generating  a 
similar  PAC  library  from  the  129  mou.se  strain. 

To  facilitate  the  additional  use  of  large-insert  bacterial 
clones  for  functional  studies,  we  have  prepared  new  PAC 
&  BAC  vectors  with  a  dominant  selectable  marker  gene 
(the  blasticidin  gene  under  control  of  the  beta-actin  pro- 
moter), an  EBV  replicon  and  an  "update  feature".  This  fea- 
ture utilizes  the  specificity  of  Transposon  Tn7  for  the  Tn7att 
sequence  (in  the  new  PAC  and  BAC  vectors)  to  transpose 
marker  genes,  other  replicons  and  other  sequences  into  PACs 
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or  BACs.  Hence,  it  facilitates  retrofitting  existing  PAC/ 
BAC  clones  (made  with  the  new  vectors)  with  desirable 
sequences  without  affecting  the  inserts.  The  new  vector(s) 
are  being  applied  to  generate  second  generation  libraries 
for  human  (female  donor),  mouse  and  rat. 

DOE  Grant  No.  DE-FG02-94ER61883  and  NIH  Grant  No. 
IR01RGOI165 

Development  of  Affinity  Technology  for 
Isolating  Individual  Human 
Chromosomes  by  Third-Strand 
Binding 

Jacques  R.  Fresco  and  Marion  D.  Johnson  III 
Department  of  Molecular  Biology;  Princeton  University; 
Princeton,  NJ  08544-1011 
609/258-3927.  Fax:  -6730 
esteckman  @  molbiol.princeton.  edu 
http://molhiol.princeton.edu 

Prior  to  the  onset  of  this  grant,  solution  conditions  had 
been  developed  for  binding  a  17-residue  third  strand 
oligodeoxyribonucleotide  probe  to  a  specific  human  chro- 
mosome (HC)  17  multicopy  alpha  satellite  target  sequence 
cloned  into  DNA  vectors  of  varying  size  up  to  50  kb. 
Binding  was  shown  to  be  both  highly  efficient  and  spe- 
cific. Moreover,  initial  experiments  with  fluorescent-la- 
beled third  strands  and  human  lymphocyte  metaphase 
spreads  and  interphase  nuclei  proved  similarly  successful. 
During  the  current  research  period,  the  technology  for  such 
third  strand-based  cytogenetic  examination,  i.e..  Triplex  In 
Situ  Hybridization  or  TISH,  of  such  spreads  was  perfected, 
so  that  it  is  now  a  highly  reproducible  method.  Compari- 
son of  spreads  of  different  individuals  by  TISH  and  FISH 
analysis  has  provided  a  new  basis  for  detecting  alpha  satel- 
lite DNA  polymorphisms,  the  basis  of  which  requires  fur- 
ther investigation. 

This  year  work  also  commenced  on  the  development  of 
comparable  probes  specific  for  alpha  satellite  sequences  in 
HC-X.  1 1,  and  16.  The  work  with  HC-X  has  reached  the 
stage  where  we  are  ready  to  test  the  probe  for  TISH -based 
cytogenetic  analysis.  Solution  studies  of  the  interaction  of 
the  probes  designed  for  HC-I I  and  HC-16  alpha  satellite 
targets  are  following  the  well-established  path  we  em- 
ployed for  HC-17  and  HC-X.  With  the  expectation  of  suc- 
cess in  these  cases  during  the  coming  year,  the  way  should 
be  clear  for  the  development  and  application  of  compa- 
rable probes  for  alpha  satellite  sequences  of  any  other  hu- 
man chromosomes  that  may  be  of  interest,  and  possibly  of 
other  eukaryotic  species. 

Meanwhile,  we  have  begun  to  turn  our  attention  to  two 
other  goals,  one  being  the  exploitation  of  our  probes  for 
the  isolation  of  individual  human  chromosomes  by  affinity 
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purification,  as  we  originally  proposed.  The  other  goal  is 
to  exploit  our  probes  as  aids  in  flow  sorting  human  chro- 
mosomes, a  direction  of  work  we  expect  to  pursue  in  col- 
laboration with  the  Lx)s  Alamos  National  Laboratory,  just 
as  soon  as  they  indicate  a  readiness  to  do  so.  Finally,  we 
have  begun  to  evaluate  the  possibility  of  using  third-strand 
binding  fluorescent  probes  for  detection  of  single  copy 
genes  by  means  of  photon  counting,  a  goal  which  we  plan 
to  undertake  with  our  colleague  Robert  Austin  of  our  Phys- 
ics Department 

DOE  Grant  No.  DE-FGO2-96ER622202. 

Chromosome  Region-Specific  Libraries 
for  Human  Genome  Analysis 

Fa-Ten  Kao 

Eleanor  Roosevelt  Institute  for  Cancer  Research;  Denver, 

CO  80206 

303/333-4515,  Fax:  -8423,  kao@eri.uckic.edu 

The  objective  of  this  project  is  to  construct  and  character- 
ize chromosome  region-specific  libraries  as  resources  for 
genome  analysis.  We  have  used  our  chromosome  micro- 
dissection and  Mbol  linker-adaptor  technique  (PNAS  88, 
1844,  1991)  to  construct  region-specific  libraries  for  hu- 
man chromosome  2  and  other  chromosomes.  The  libraries 
have  been  critically  evaluated  for  high  quality,  including 
insert  size,  proportion  of  unique  vs  repetitive  sequence 
microclones,  percentage  of  microclones  derived  from  dis- 
sected region,  etc. 

We  have  constructed  and  characterized  1 1  region-specific 
libraries  for  the  entire  human  chromosome  2  (the  second 
largest  human  chromosome  with  243  Mb  of  DNA),  includ- 
ing 4  libraries  for  the  short  arm  and  6  libraries  for  the  long 
arm,  plus  a  library  for  the  centromere  region.  The  libraries 
are  large,  containing  hundreds  of  thousands  of  microclones 
in  plasmid  vector  pUCI9,  with  a  mean  insert  size  of  200 
bp.  About  40-60%  of  the  microclones  contain  unique  se- 
quences, and  between  70-90%  of  the  microclones  were 
derived  from  the  dissected  region.  In  addition,  we  have 
isolated  and  characterized  many  unique  sequence 
microclones  from  each  library  that  can  be  readily  se- 
quenced as  STSs,  or  used  in  isolating  other  clones  with 
large  inserts  (like  YAC,  BAC.  PAC,  PI  or  cosmid)  for 
contig  assembly.  These  libraries  have  been  used  success- 
fully for  high  resolution  physical  mapping  and  for  posi- 
tional cloning  of  disease-related  genes  assigned  to  these 
regions,  e.g.  the  cloning  of  the  gene  for  hereditary 
nonpolypsis  colorectal  cancer  (Cell  75,  1215,  1993). 

For  each  library,  we  have  established  a  plasmid  sub-library 
containing  at  least  20,000  independent  microclones.  These 
sub- libraries  have  been  deposited  to  ATCC  for  permanent 
maintenance  and  general  distribution.  The  ATCC  Reposi- 
tory numbers  for  these  libraries  are:  #87188  for  2PI  library 
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(region  2p23-p25.  comprising  25  Mb);  #87189  for  2P2 
library  (2p21  -p23. 28  Mb);  #87103  for  2P3  library 
(2pl4-pl6.  22  Mb);  #87104  for  2P4  library  (2pll-pl3,  28 
Mb);  #77419  for  2Q1  Ubrary  (2q35-q37,  28  Mb);  #87308 
for  2Q2  library  (2q33-q35,  24  Mb);  #87309  for  2Q3  li- 
brary (2q31-q32,  26  Mb);  #87310  for  2Q4  library 
(2q23-q24,  19  Mb);  #87409  for  2Q5  Ubrary  (2q21-q22,  23 
Mb);  #87410  for  2Q6  Ibrary  (2qll-ql4.  31  Mb);  and 
#87411  for  2CEN  library  (2pll.l-qll.l,  4  Mb).  Details  of 
these  libraries  have  been  described:  Hum.  Genet.  93,  557, 

1994  (for  2P1  library);  CytogeneL  Cell  Genet.  68,  17, 

1995  (for  2P2  library);  SomaL  Oil  Mol.  Genet.  20,  353, 
1994  (for  2P3  library);  SomaL  Cell  Mol.  Genet.  20.  133, 
1994  (for  2P4  library);  Genomics  14,  769,  1992  (for  2Q1 
library;  SomaL  Cell  Mol.  GeneL  21,  335,  1995  (for  2Q2, 
2Q3  &  2Q4  libraries);  SomaL  Cell  Mol.  Genet.  22, 57, 

1996  (for  2Q5,  2(^6  &  2CEN  libraries). 

Region-specific  libraries  and  short  insert  microclones  for 
chromosome  2  are  particularly  useful  resources  for  its 
eventual  sequencing  because  this  chromosome  is  less  ex- 
ploited and  detailed  mapping  information  is  lacking.  We 
have  also  constructed  3  region-specific  libraries  for  the 
entire  chromosome  18  using  similar  methodologies,  in- 
cluding 18PUbrary(18pll.32-pll.l.22Mb);  18Q1  library 
(18qll.l-ql2.3,  25  Mb);  and  18Q2  library  (18q21.1-q23, 
34  Mb).  Details  of  these  libraries  have  been  described 
(Somat.  Cell  Mol.  GeneL  22,  191-199,  1996). 

DOE  Grant  No.  DE-FG03-94ER61819. 


^Identification  and  Mapping  of 

DNA -Binding  Proteins  Along  Genomic 

DNA  by  DNA-Protein  CrossHnking 

V.L.  Karpov,  O.V.  Preobrazhenskaya,  S.V.  Belikov,  and 

D.E.  Kamashev 
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In  1995-1996  we  continued  to  map  and  identify  nonhistone 
proteins  binding  at  loci  along  the  yeast  chromosome.  Using 
DNA-protein  crosslinking  in  vivo,  we  detected  two  polypep- 
tides that  probably  correspond  to  core  subunits  of  yeast 
RNA-polymerase  II  in  the  coding  region  of  the  transketolase 
gene  (TKL2).  Several  nonhistone  proteins  were  detected 
that  bind  to  the  upstream  region  of  TKL2  and  to  an 
intergenic  spacer  between  calmodulin  (CMDl)  and 
mannosyl  transferase  (ALGl)  genes.  The  apparent  molecular 
weight  of  these  proteins  was  estimated.  We  also  developed 
a  new  method  to  synthesize  strand-specific  probes. 

Using  DNA-protein  crosslinking  in  vitro,  we  found  the 
amino  acid  residues  of  the  Lac-repressor  that  interacts  with 
DNA.  Only  Lys-33  crosslinks  with  the  Lac-operator  in  the 
specific  complex. 


In  addition  to  Lys-33,  the  N-terminal  end  of  the  protein 
also  crossUnks  in  a  nonspecific  complex.  Our  results  dem- 
onstrate that,  in  the  presence  of  an  inducer,  the  repressor's 
N-termini  crosslink  to  the  operator's  outermost  nucle- 
otides. We  suggest  that  binding  of  an  inducer  changes  the 
orientation  of  the  DNA-binding  domain  of  the  Lac  repres- 
sor to  the  opposite  of  that  found  for  the  specific  complex. 

We  plan  to  use  a  new  method  to  increase  resolution  and 
thus  identify  amino  acids  and  nucleotides  that  participate 
in  DNA-protein  recognition.  The  mechanisms  of  transcrip- 
tion regulation  of  some  yeast  genes  will  thus  be  further 
elucidated.  Our  approaches  are  based  on  DNA-protein 
crosslinking.  Detailed  analysis  will  be  done  for  specific 
and  nonspecific  complexes,  in  the  presence  and  absence  of 
inducers.  This  will  allow  us  to  make  some  conclusions 
about  possible  conformational  rearrangements  in 
DNA-protein  complexes  during  gene  activation  at  the 
protein's  DNA-binding  domains. 

DOE  Grant  No.  OR00033-93CIS007. 
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While  the  complete  sequencing  the  human  genome  at 
99.99%  accuracy  is  an  immediate  goal  of  the  Hiunan 
Genome  ProjecL  a  serious  technical  deficiency  remains  the 
ability  to  rapidly  and  efficiently  construct  sequence  ready 
maps  as  sequencing  templates.  This  is  particularly  prob- 
lematic in  regions  with  untisual  genome  structure.  An  un- 
derstanding of  these  troublesome  regions  prior  to 
genome-wide  sequencing  will  provide  quality  assurance  as 
well  as  reliable  sequencing  strategies  in  these  regions. 
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This  proposal  will  generate  a  "whole  genome"  data  re- 
source to  enable  rapid  and  reliable  sequencing  of  genomic 
DNA  by  the  definition  and  characterization  of  the  more 
than  52  regions  of  high  homology  now  known  to  be  dis- 
tributed within  unrelated  genomic  regions  and  cloned  in 
BACs  and  PACs.  To  do  this,  we  will: 

1.  Define  regions  of  true  homology  in  the  human  genome 
by  characterizing  subsets  of  the  4,700  BAC/PACs  that 
generate  multiple  hybridization  signals  using  fluorescence 
in  situ  hybridization  (FISH).  Of  the  1,200  sites  of  multiple 
signals,  more  than  52  regions  contain  repeats  as  defined  by 
600  BAC/PACs.  The  chimerism  rate,  multiple  clone  wells, 
and  chromosome  of  origin  will  be  defined  by  re-streaking 
each  clone,  followed  by  fingerprint,  FISH  and  PCR-ba.sed 
end-sequence  analyses  on  hybrid  panels  and  radiation  hy- 
brids. 

Data  will  be  shared  with  large  sequencing  efforts,  depos- 
ited in  the  4D  database,  available  with  annotation  on  ftp 
server  and  through  GDB. 

2.  Generate  contigs  of  BACs  and  PACs  in  regions  of  com- 
plex genome  organization.  Using  STS,  EST  analyses,  fin- 
gerprinting, BAC/PAC  to  BAC/PAC  Southerns,  end  se- 
quence walking  in  3.5-20X  libraries,  and  metaphase/inter- 
phase  FISH,  contigs  will  be  seeded  in  2-5  of  the  regions  of 
known  genome  complexity,  each  of  which  is  estimated  as 
2-5  Mb.  These  data  will  be  used  to  evaluate  and  provide 
independent  quality  assurance  of  the  STS  and  Radiation 
hybrid,  and  genetic  maps  in  these  regions.  The  most  sig- 
nificant of  these  include  lp36/lq;  2p/q;  multiple  sites; 
gp23  and  8  further  sites;  9p/q. 

3.  Define  additional  regions  of  complex  genomic  structure. 
Library  screening  using  known  members  of  multiple  mem- 
ber retro-transposon  and  other  known  repeated  sequences 
defined  by  the  ncbi  database,  followed  by  FISH  analyzes 
to  determine  structure  and  potential  large  regions  of  asso- 
ciated homologies. 

Collaboration  with  other  genome  and  sequencing  centers 
will  provide  quality  control  in  the  generation  of 
sequence-ready  maps  for  sequencing  templates. 

We  believe  that  this  effort  is  important  since  1)  it  will  pro- 
vide a  critical  mapping  tool  necessary  for  the  generation  of 
sequence  ready  maps;  2)  if  initiated  now,  the  problem  ar- 
eas could  be  delineated  before  scale  ups  to  full  production 
occur  in  major  genome  centers;  3)  represents  a  modest  cost 
such  that  the  cost  of  these  data  would  comprise  only  a 
small  fraction  of  the  cost  of  the  entire  genome  sequence 
and  would  vasdy  decrease  the  cost  of  sequencing  errors  4) 
and  could  be  completed  in  a,  short  time  (2  to  3  years)  so  as 
to  be  of  maximum  benefit  to  sequencing  centers.  The  Prin- 
cipal Investigator  in  this  project  is  ideally  suited  for  this 
effort  because  the  group  has  developed  the  technology  and 
initiated  FISH  and  genome  analyses  of  over  4000  clones. 


We  believe  that  this  project  represents  a  critical  and  timely 
effort  to  enable  rapid  and  cost  effective  human  genome 
sequencing. 

Subcontract  under  Glen  Evans'  DOE  Grant  No. 
DE-FC03-96ER62294. 
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The  human  X  chromosome  is  significant  from  both  medi- 
cal and  evolutionary  perspectives.  It  is  the  location  of  sev- 
eral hundred  genes  involved  in  human  genetic  disease,  and 
has  maintained  synteny  among  mammals:  both  of  these 
aspects  are  due  to  its  role  in  sex  determination  and  the  hap- 
loid  nature  of  the  chromosome  in  males.  We  have  ad- 
dressed the  mapping  of  this  chromosome  through  a  num- 
ber of  efforts,  ranging  from  long-range  YAC-based  map- 
ping to  genomic  sequence  determination. 

YAC  mapping.  The  YAC-based  map  of  the  X  is  essentially 
complete.  We  have  constructed  a  40  Mb  physical  map  of 
the  Xp22.3-Xp21.3  region,  spanning  an  interval  from  the 
pseudoautosomal  boundary  (PABX)  to  the  Duchenne  mus- 
cular dystrophy  gene.  This  region  is  highly  annotated,  with 
85  breakpoints  defining  53  deletion  intervals.  175  STSs 
(20  of  which  are  highly  polymorphic),  and  19  genes. 

Cosmid  binning.  The  YAC-based  physical  is  being  used  in 
a  systematic  effort  to  identify  and  sort  cosmids  prepared  at 
LLNL  from  flow  sorted  X  chromosomes  into  intervals. 
Gene  identification  through  use  of  a  common  database  for 
cDNA  pool  hybridization  data  is  continuing.  Over  50 
YACs  have  been  utilized  as  probes  to  the  gridded  cosmic 
arrays.  These  have  identified  over  9000  cosmids  from  the 
24,000  member  library.  An  additional  4000  cosmids  have 
been  identified  using  a  variety  of  probes,  with  the  bulk 
coming  from  cDNA  pool  probes.  More  recent  emphasis 
has  been  placed  on  BAC  clones  as  their  identity  for 
sequencing  has  been  established.  These  have  been  identi- 
fied using  the  usual  methods. 

Cosmid  contig  construction.  Creation  of  long-range  conti- 
nuity in  cosmids  and  BACs  proceeds  from  clones  identi- 
fied by  the  YAC-based  binning  experiments.  Identification 
of  STS  carrying  clones  is  carried  out  by  a  combined  PCR/ 
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hybridization  protocol,  and  adds  to  the  specificity  of  the 
overiap  data.  Cosmids  are  grown  and  DNA  is  prepared  by 
an  Autogen  robot.  DNAs  are  digested  and  analyzed  by  the 
AB362  GeneScanner  for  collection  of  fingerprint  data.  The 
use  of  novel  fluorescent  dyes  (BODIPY)  in  this  applica- 
tion has  increased  signal  strength  markedly.  End  fragment 
detection  is  currently  carried  out  with  traditional  Southern 
hybridization,  however  additional  dyes  will  permit  detec- 
tion without  hybridization  in  the  GeneScanner  protocol. 
Data  are  transferred  to  a  Sybase  database  and  analyzed 
with  ODS  (J.  Arnold,  U.  Georgia)  software  for  overlap. 
ODS  output  is  ported  to  GRAM  (LANL)  for  map  con- 
struction. A  fully  automated  approach  has  yet  to  be 
achieved,  but  this  goal  is  increasingly  in  reach. 

Sequencing.  An  independently  funded  project  awarded  to 
RAG  seeks  to  develop  long-range  genomic  sequence  for 
~2  Mb  of  the  human  X  chromosome.  In  support  of  this 
project,  cosmids  have  been  constructed  and  isolated  for  the 
1 .6  Mb  region  between  FRAXA  and  FRAXF  in 
Xq27.3-Xq28.  To  date,  the  complete  sequences  of  the  re- 
gions surrounding  the  FMRl  and  IDS  genes  have  been 
determined  (180  and  130  kb,  respectively),  along  with  an 
additional  -700  kb  of  the  interval.  This  sequence  has  led  to 
identification  of  the  gene  involved  in  FRAXE  mental  retar- 
dation. Additional  sequence  in  Xq28  has  been  determined, 
including  that  of  a  cosmid  containing  the  two  genes, 
DXS1357E  and  a  creatine  transporter.  This  sequence  has 
been  duplicated  to  chromosome  16pl  1  in  recent  evolution- 
ary history.  Comparative  sequence  analysis  reveals  94% 
sequence  identity  over  25  kb,  and  the  presence  of 
pentameric  repeats  which  are  likely  to  have  mediated  the 
duplication  event.  A  number  of  technical  advances  in 
sequencing  have  been  developed,  including  the  use  of 
BODIPY  dyes  in  AB373  sequencing  protocols,  which  has 
offered  enhanced  base  calling  due  to  reduced  mobility 
shifting,  improved  single  strand  template  protocols  for 
much  reduced  cost,  and  streamlined  informatics  processes 
for  assembly  and  annotation. 

IX)E  Grant  Nos.  DE-FG05-92ER61401  and 
DE-FG03-94ER61830  and  NIH  Grant  No.  5P30 
HG00210. 
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Repetitive  sequences  occupy  the  most  part  of  the  whole 
eukaryotic  genome  but  up  to  the  last  few  years  there  has 
not  been  much  interest  in  their  role.  The  situation  changed 
when  alpha-satellites  in  human  and  minor  satellites  in 
mouse  became  candidates  for  centromere  function  respon- 
sibility. A  number  of  centromere-specific  proteins  are 
under  investigation  but  none  seems  to  distinguish  centro- 
meric  functions  of  exact  sequences  among  long  arrays  of 
tandemly  repeated  satellites.  The  proteins  associated  with 
that  array  are  poorly  known.  We  are  trying  to  find  out  what 
proteins  are  involved  in  maintaining  the  heterochromatin 
structure  of  different  types  of  repetitive  sequences. 

The  major  proportion  of  total  genomic  satellite  DNA  re- 
mains attached  to  the  nuclear  matrix  (NM)  after  DNasel 
and  high  salt  treatment.  We  followed  this  association  in 
various  steps  during  NM  preparation  by  in  situ  hybridiza- 
tion with  the  mouse  satellite  probe.  Two  mouse  species 
were  used  -M.  musculus  and  M.  spretus.  Both  contain  the 
same  repertoire  of  satellite  DNAs  but  in  different  amounts. 
In  M.  musculus  the  centromeric  heterochromatin  contains 
major  satellite  (MA)  as  the  principal  component.  In  M. 
spretus  the  minor  satellite  (MI)  is  predominant.  To  test 
DNA-binding  activity  of  the  proteins  after  chromatogra- 
phy of  the  soluble  NM  proteins  on  cationic  and  anionic 
ion-exchange  columns,  gel  shift  assays  were  performed 
with  cloned  dimer  of  MA  and  a  trimer  of  MI.  To  produce 
antibodies,  the  DNA-protein  complexes  obtained  from 
large-scale  gel-shift  assays  were  isolated  and  injected  into 
a  guinea  pig. 

The  gel  shift  assay  with  column  fractions  from  M.  muscu- 
lus NM  and  MA  shows  a  ladder  of  complexes.  The  com- 
plexes could  be  competed  out  with  an  excess  of  MA  DNA 
but  not  with  the  same  amount  of  E.  coli  DNA.  Antibodies 
from  the  immune  serum  caused  a  hypershift  of  the  MA/ 
NM  protein  complexes.  Preimmune  serum  at  the  same  di- 
lution did  not  alter  the  mobility  of  the  complexes.  A  com- 
bination of  western  and  Southern  blots  allows  us  to  con- 
clude that  a  protein  with  a  molecular  weight  of  about  80 
kD  and  some  similarity  to  the  intermediate  filaments  i.s 
responsible  for  the  MA/NM  interaction. 

Specific  DNA-binding  activity  to  the  MI  has  been  tested 
after  column  fractionation  of  the  M.  spretus  NM  extract.  A 
ladder  of  complexes  can  be  competed  out  with  an  excess 
of  unlabeled  MI  but  not  E.  coli  or  MA  DNA.  MI  contains 
the  CENPB-box  sequence,  which  is  the  binding  site  for  the 
protein  CENPB,  one  of  the  centromeric  proteins.  Fractions 
ft'om  the  NM  extract  with  Ml-specific  binding  activity  do 
not  contain  CENPB,  as  shown  by  western  blotting  with 
anti-CENPB  antibodies. 

The  same  kind  of  work  is  going  on  with  human  analogs  of 
MA  and  MI  sequences,  using  large  clones  of  satellite  and 
alpha-satellite  DNA  and  nuclear  matrices. 
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There  are  few  satellite  DNA-binding  proteins  isolated, 
none  of  them  directly  ftom  the  NM.  Our  long-term  aim  is 
to  understand  the  role  of  these  proteins  in  heterochromatin 
formation  and  in  heterochromatin  association  with  NM. 

Extracts  firom  hand-isolated  nuclear  envelopes  from  frog 
oocytes  were  tested  for  the  specific  DNA-binding  activity 
to  (T2G4)116.  A  fragment  of  Tetrahymena  telomere  from  a 
YAC  plasmid  was  used  as  a  labelled  probe  in  a  gel-shift 
assay.  The  DNA-protein  complexes  from  the  assay  were 
cut  out  and  injected  into  a  guinea  pig.  The  antibodies  (AB) 
obtained  stained  one  protein  with  an  m.w.  of  about  70  kD 
in  the  nuclear  envelope  of  the  oocyte,  nothing  in  the  inner 
part  of  the  oocyte,  and  70  kD  and  120  kD  in  the  fixjg  liver 
nuclei.  The  immunofluorescent  AB  stained  fme  patches  on 
the  oocyte  nuclear  envelope  and  a  number  of  intranuclei 
spots  in  the  frog  blood  cells. 

The  electron-microscope  immuno-gold  technique  showed 
that  the  protein  is  localized  in  the  outer  surface  of  the  oo- 
cyte nuclear  envelope  in  cup-like  structures.  DNA-binding 
activity  to  the  same  sequence  has  been  tested  and  found  in 
the  mouse  nuclear  maU-ix  extracts.  The  activity  could  be 
eluted  from  the  DEAE52  ion  exchange  column  in  0.15 
NaCl.  The  activity  could  be  competed  out  with  the  frag- 
ment itself  but  not  with  E.  coli  DNA  in  the  same  amounts. 
AB  stained  a  70-kD  protein  in  active  fractions  after  ion 
exchange  chromatography.  In  nuclear  matrix  preparations, 
the  AB  recognized  a  120-kD  protein  as  well.  The  AB 
caused  hypershifi  of  the  complexes  on  the  gel  shift  assay. 
The  AB  has  some  affmity  to  the  keratins.  In  the  mouse  cell 
culture  3T3  line  the  staining  is  intranuclei,  widi  fme  dots 
forming  chains  surrounding  dark  areas,  which  do  not  cor- 
respond to  the  nucleoli. 

Similar  results  were  observed  when  a  mouse  cell  line  was 
transformed  with  head-and  tail-less  human  keratin  con- 
structs (Bader  et  a]..  1991 , 7  Cell  BinI  115: 1293).  These 
results  suggest  that  the  nuclear  proteins  detected  with  the 
AB  may  be  natural  analogs  of  this  artificial  keratin  con- 
struct. The  pattern  of  staining  did  not  resemble  the  picture 
of  telomere-specific  staining.  Possibly  the  protein  recog- 
nized intragenomic  (T2G4)2  sequence,  which  is  present  in 
25%  of  murine  GenBank  sequences  rather  than  telomere. 
We  are  going  to  do  immunocytochemical  investigations  of 
frog  and  mouse  development  in  order  to  determine  the 
point  when  transcription  of  the  120-  kD  protein  is  initiated 
and  the  staining  becomes  intranuclear. 

As  a  continuation  of  the  previous  project  the  multiple 
alignment  of  all  the  A/u  sequences  from  GenBank  is  going 
on.  We  are  also  trying  to  obtain  antibodies  to  the  main 
i4/u-binding  proteins  to  find  out  how  many  proteins  could 
be  bound  to  Alu  sequence. 

DOE  Grant  No.  OR00033-93C1S014. 
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POU  domain  of  Oct-2  transcription  factor  binds  octamer 
sequence  ATGCAAAT  and  a  number  of  degenerated  se- 
quences. It  has  been  shown  that  POUs  and  POUh  domains 
recognize  left  and  right  parts  of  the  oct-sequence,  respec- 
tively. The  recognized  sequences  are  partly  overlapped  in 
the  native  octamer.  In  the  degenerated  recognition  sites 
these  core  sequences  may  be  separated  with  a  spacer  up  to 
four  nucleotides.  The  obtained  data  changed  our  view  on 
the  number  and  structure  of  potential  targets  recognized  on 
DNA  by  POU  proteins. 

Protein-DNA  binding  is  realized  due  to  interaction  of  a 
conservative  amino  acid  residues  with  a  DNA  target.  In 
POU  proteins  amino  acid  residues  in  positions  47  (Val).  50 
(Cys)  and  51  (Asn)  of  POUh  domain  are  absolutely  con- 
servative. In  order  to  examine  a  possible  role  of  Val47  we 
substituted  this  residue  by  each  of  the  19  other  amino  acid 
residues  and  the  interaction  of  the  mutant  proteins  was  in- 
vestigated with  homeospecific  site  and  its  variants 
(ATAANNN)  and  with  oct  sequence.  It  was  shown  that 
Ile47  mutant  retains  the  affinity  and  specificity.  Val  re- 
placement for  Ser,  Thr  or  His  partially  reduce  the  affinity. 

Asi>47  mutant  sharply  relax  the  specificity  of  protein-DNA 
recognition.  Mutants  at  47  position  have  much  stronger 
effects  on  binding  to  homeospecific  sites  than  to  octamer 
motifs.  Our  data  indicate  that  there  is  not  a  simple 
mono-letter  code  of  protein/DNA  recognition.  It  has  been 
shown  that  this  recognition  is  determined  not  only  by  the 
nature  of  the  radicals  involved  in  the  contact  but  also  by 
the  structure  of  DNA  binding  domain  as  a  whole  and  prob- 
ably by  cooperative  interaction  of  POUs  and  POUh  domains. 

Proposals  for  1997.  The  role  of  Cys50  in  POU  domain/ 
DNA  recognition  will  be  investigated.  This  residue  is  ab- 
solutely conservative  in  POU  proteins  but  it  is  variable  in 
relative  homeo-proteins.  Our  preliminary  data  allow  to 
suppose  that  residue  at  position  50  of  POU  homeodomain 
have  a  key  role  in  discrimination  between  TAAT-like  and 
octamer  sequences.  The  role  of  the  nuleotides  flanking 
DNA  target  will  be  investigated. 

DOE  Grant  No.  OR00033-93CIS005. 

Relevant  Publications 

1.  S(q)chciiko  A.G.  ( 1 994)  Noncanonica]  ocl-sequeoces  are  tarsels  for 

mouK  Ocl-2B  transcTipoofi  faclor.  FEBS  Leltcpi,  V.337.  P.I75-I78. 

2.  SlqKhcnka  A.G..  Polanovsky  01..  (1996)  Inlcraction  of  Ocl  proteins 

with  DNA.  Molecular  Biology.  V.30,  P.296-302 

3.  Slcpchenko  A.G..  Luchina  N.N..  Polanovdcy  OJ..  Tlie  role  of 

conservative  VaJ47  for  POU  bomeodomain/DNA  recognition.  FTBS 
Letters,  in  press. 


DOE  Human  Genom*  Program  Report,  Part  2, 1996  Research  Abrtrai:ts 


25 


271 


Mapping 

*Development  of  Intracellular  Flow 
Karyotype  Analysis 

V.V.  Zenin.'  N.D.  Aksenov,'  A.N.  Shatrova,'  N.V.  Klopov,^ 

L.S.  Cram,'  and  A.I.  PoleUev 

Engelhardt  Institute  of  Molecular  Biology;  Russian 

Academy  of  Sciences;  Moscow  1 17984,  Russia 

Poletaev:  +7-095/135-9824,  Fax:  -1405 

polet@polet.  msk.  su 

'Institute  of  Cytology;  Russian  Academy  of  Sciences; 

St.  Petersburg,  Russia 

^St.  Petersburg  Institute  of  Nuclear  Physics;  Gatchina,  Russia 

'Los  Alamos  National  Laboratory;  Los  Alamos.  NM  87545 

Instrumentation  for  univariate  fluorescent  flow  analysis  of 
chromosome  sets  has  been  developed  for  human  cells.  A 
new  method  of  cell  preparation  and  intracellular  staining 
of  chromosome  with  different  dyes  was  developed  and 
improved.  Cells  suspension  for  flow  analysis  must  satisfy 
the  following  requirements;  minimal  amount  of  free  chro- 
mosomes and  debris  (dead  cells,  cell  fragments  etc.);  chro- 
mosomes structure  must  be  stabilized  inside  mitotic  cells; 
chromosomes  must  be  stained  inside  the  cells  up  to  satura- 
tion with  the  used  dyes;  chromosomes  must  be  able  to  re- 
lease from  cells  with  minimal  possible  mechanical  treat- 
ment The  method  includes  enzyme  treatment  (chymot- 
rypsin),  incubation  with  saponin  and  separation  of 
prestained  cells  from  debris  on  sucrose  gradient.  The  de- 
veloped protocol  was  tested  and  improved  in  the  course  of 
several  months  of  work  and  allows  us  to  obtain  a  well 
stained  sample  with  a  minimal  amoimt  of  contaminates  [2]. 

A  special  magnetic  mixing/stirring  device  was  constructed 
to  perform  cell  membrane  breaking.  It  was  placed  inside 
the  flow  chamber  of  a  serial  flow  cytometer  ATC-3000 
equipped  with  additional  electronic  card  for  time-gated 
data  acquisition  [1].  The  rupturing  of  prestained  mitotic 
cells  is  performed  by  means  of  a  small  magnetic  rod  vi- 
brating in  an  alternative  magnetic  field.  The  efficiency  of 
mitotic  cells  breaking  with  electromagnetic  cell  breaking 
device  was  tested  using  different  human  cell  lines[2,3]. 

The  device  works  in  a  stepwise  mode:  a  defined  volume  of 
sample  is  delivered  to  the  breaking  chamber  for  rupturing 
mitotic  cell  (cells)  for  a  defined  time  period,  followed  by 
buffer  wash  to  move  the  released  chromosomes  from  the 
breaking  chamber  to  the  point  of  the  analysis.  The  infor- 
mation about  the  chromosomes  appearing  at  the  point  of 
analysis  is  accumulated  in  list  mode  files,  making  it  pos- 
sible to  resolve  chromosome  sets  arising  from  single  cells 
on  the  basis  of  time  gating.  The  concentration  of  cells  in 
the  sample  must  be  kept  low  to  ensure  that  only  one  cell  at 
a  time  enters  the  breaking  device. 

The  developed  software  classifies  chromosome  sets  ac- 
cording to  different  criteria:  total  number  of  chromosomes, 
overall  DNA  content  in  the  set,  and  the  number  of  chromo- 


somes of  certain  type  [2,3].  In  addition  it's  possible  to  de- 
termine the  presence  of  extra  chromosomes  or  loss  of 
chromosome  types.  Thus  this  approach  combines  the  high 
performance  of  flow  cytometry  (quantitation  and  high 
throughput)  with  the  advantages  of  image  analysis  (cell  to 
cell  karyotype  analysis  and  skills  of  trained  cytogeneti- 
cist).  The  data  analysis  capabilities  offer  extensive  flexibil- 
ity in  determining  important  features  of  the  karyotypes 
under  study.  This  development  offers  the  potential  to  du- 
plicate most  of  what  is  determined  by  clinical  cytogeneti- 
cists.  The  results  now  obtained  are  in  good  accordance 
with  goals  of  the  project  formulated  before  [4]. 
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BACs  and  fosmids  are  stable,  nonchimeric,  and  highly 
representative  cloning  systems.  BACs  maintain 
large-fragment  genomic  inserts  (100  to  3(X)  kb)  that  are 
easily  prepared  for  most  types  of  experiments,  including 
DNA  sequencing. 

We  have  improved  the  methods  for  generating  BACs  and 
developed  extensive  BAC  libraries.  We  have  constructed 
human  BAC  libraries  with  more  than  175,000  clones  from 
male  fibroblast  and  sperm,  and  a  mouse  BAC  library  with 
more  than  200,000  clones.  We  are  currently  expanding  hu- 
man library  with  the  aim  of  achieving  total  SOX  coverage 
human  genomic  library  using  sperm  samples  from  anony- 
mous donors. 
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The  BAC  libraries  provide  resources  to  bridge  the  gap  be- 
tween genetic -cytogenetic  information  and  detailed  physi- 
cal characteristics  of  genomic  regions  that  include  DNA 
sequence  information.  They  also  provide  reliable  tools  for 
generating  a  high-resolution,  integrated  map  on  which  a 
variety  of  information  and  resources  are  correlated.  Using 
primarily  the  human  BAC  library  constructed  from  fibro- 
blasts, we  have  assembled  a  physical  contig  map  of  chro- 
mosome 22  [1].  First,  the  entire  library  was  screened  by 
most  of  the  known  chromosome  22-specific  markers  that 
include  cDNA,  anonymous  STS  markers,  FISH-mapped 
cosmids  and  fosmids,  YAC-Alu  PCR  products, 
FISH-mapped  BACs,  and  flow-sorted  chromosome  22 
DNA.  The  positive  clones  have  been  assembled  into 
contigs  by  means  of  the  STS-contents  or  other  markers 
assigned  to  BAC  clones.  Most  of  the  contigs  were  con- 
firmed by  using  a  restriction  fingerprinting  scheme  origi- 
nally developed  by  Sulston  and  Coulson,  and  modified  in 
our  laboratory.  Currently,  the  contigs  cover  over  80%  of 
the  chromosome  arm.  Various  physical  or  genetic  land- 
marks on  this  chromosome  can  now  be  precisely  localized 
simply  by  assigning  them  to  BACs  or  contigs  on  the  map. 
Using  BAC  end  sequence  information  from  each  of  the 
chromosome  22-specific  BACs,  it  is  now  possible  to  close 
the  gaps  efficiently  by  screening  deeper  BAC  libraries 
with  new  probes  specific  to  the  ends  of  contigs. 

The  resulting  BAC  contig  map  is  now  serving  as  a  road 
map  for  sequencing  the  chromosome.  Chromosome 
22-specific  BAC  clones  have  been  distributed  to  our  col- 
laborators including  The  Sanger  Center  and  Dr.  Bruce  Roe 
in  University  of  Oklahoma,  and  many  of  the  clones  have 
already  been  sequenced.  BAC  end  sequencing  scheme[2] 
will  play  a  crucial  role  toward  the  complete  sequencing  of 
chromosome  22,  and  we  are  currently  sequencing  the  ends 
of  these  BACs  directly  using  the  miniprepped  BAC  DNA 
as  templates. 
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BAC  clones  are  ideal  for  genome  analysis  since  they  are 
non-chimeric,  stably  maintain  large  fragment  genomic  in- 
serts (100-300  kb)[l],  and  it  is  easy  to  prepare  BAC  DNA 
samples  for  most  types  of  experiments  including  DNA  se- 
quencing[2].  We  have  improved  BAC  cloning  technique  in 
the  past  years  and  constructed  >20X  human  BAC  libraries. 
As  BACs  are  proving  to  be  the  most  efficient  reagents  for 
large  scale  genomic  sequencing,  we  intend  to  increase  the 
depth  of  the  library  to  SOX  genomic  equivalence.  Using 
the  ESTs.  especially  the  Unigenes  that  have  been  chromo- 
somally  assigned  by  other  means  such  as  Radiation  Hybrid 
mapping  and  YAC-based  STS  content  mapping,  we  plan  to 
organize  the  BAC  library  into  a  mapped  resource.  The  re- 
sulting BAC-EST  framework  map  will  provide  a  high 
resolution  EST  (or  gene)  map  and  instant  entry  points  for 
gene  finding  and  large  scale  genomic  sequencing.  We  also 
intend  to  determine  the  end  sequences  of  the  BAC  inserts 
firom  a  significant  number  of  the  clones  (at  least  350,000 
clones  or  15X  genomic  equivalence)  within  two  years  [3]. 
All  the  BAC-EST  mapping  data  and  BAC  end  sequences 
will  be  made  available  via  public  databases  and  WEB 
servers.  The  mapping  data  and  end  sequence  information 
will  dramatically  facilitate  the  process  of  finding  clones 
that  extend  the  sequenced  regions  with  minimal  overlaps. 
Thus,  the  tagged  BAC  libraries  will  serve  as  a  reliable  and 
facile  sequence-ready  resource  and  an  organizing  tool  to 
support  and  coordinate  simultaneously  multiple  sequenc- 
ing projects  all  over  the  genome. 
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Large- scale  single-pass  sequencing  of  cDNA  clones  ran- 
domly picked  from  libraries  has  proven  quite  powerful  to 
identify  genes  and  the  use  of  normalized  libraries  in  which 
the  frequency  of  all  cDNAs  is  within  a  narrow  range  has 
been  shown  to  expedite  the  process  by  minimizing  the  re- 
dundant identification  of  die  most  prevalent  mRNAs.  In  an 


DOE  Human  Genome  Program  Report,  Part  2,  1996  Research  AbstracU 


27 


273 


Mapping 

attempt  to  contribute  to  the  ongoing  gene  discovery  ef- 
forts, we  have  further  optimized  our  original  procedure  for 
construction  of  normalized  directionally  cloned  cDNA  li- 
braries! 1]  and  we  have  successfully  applied  it  to  generate  a 
number  of  human  cDNA  libraries  from  a  variety  of  adult 
and  fetal  tissues  [2].  To  date  we  have  constructed  libraries 
from  infant  brain,  fetal  brain,  adult  brain,  fetal 
liver-spleen,  full-term  and  8-9  week  placentae,  adult 
breast,  retina,  ovary  tumor,  melanocytes,  parathyroid  tu- 
mor, senescent  fibroblasts,  pineal  glands,  multiple  sclero- 
sis plaques,  testis,  B  cells,  fetal  heart,  fetal  lung,  8-9  week 
fetuses  and  pregnant  uterus.  Several  additional  libraries  are 
currently  in  preparation.  All  libraries  have  been  contrib- 
uted to  the  IMAGE  consortium,  and  they  are  being  widely 
used  for  sequencing  and  mapping. 

However,  given  the  large  scale  nature  of  the  ongoing  se- 
quencing efforts  and  the  fact  that  a  significant  fraction  of 
the  human  genes  has  been  identified  already,  the  discovery 
of  novel  cDNAs  is  becoming  increasingly  more  challeng- 
ing. In  an  effort  to  expedite  this  process  further,  in  collabo- 
ration with  Greg  Lennon  (LLNL)  we  have  developed  and 
applied  subtractive  hybridization  strategies  to  eliminate 
pools  of  sequenced  cDNAs  from  libraries  yet  to  be  sur- 
veyed. Briefly,  single-stranded  DNA  obtained  firom  pools 
of  arrayed  and  sequence  I.M.A.G.E.  clones  are  used  as 
templates  for  PCR  amplification  of  cDNA  inserts  with 
flanking  T7  and  T3  primers.  PCR  amplification  products 
are  then  used  as  drivers  in  hybridizations  with  normalized 
libraries  in  the  form  of  single-stranded  circles.  The  remain- 
ing single-stranded  circles  (subtracted  library)  are  purified 
by  hydroxyapatite  chromatography,  converted  to 
double -stranded  circles  and  electroporated  into  bacteria. 
Preliminary  characterization  of  a  subtracted  fetal 
liver-spleen  library  indicates  that  the  procedure  is  effective 
to  enhance  the  representation  of  novel  cDNAs. 

In  an  effort  to  enhance  the  representation  of  full-length 
cDNAs  in  our  libraries,  as  we  strive  towards  our  final  ob- 
jective of  generating  full-length  normalized  cDNA  librar- 
ies, we  have  adapted  our  normalization  protocol  to  take 
advantage  of  the  fact  that  it  is  now  possible  to  produce 
single-stranded  circles  in  vitro  by  sequentially  digesting 
supercoiled  plasmids  with  Gene  II  protein  and  Exonu- 
clease  III  (Life  Technologies).  This  has  proven  significant 
because  it  circumvents  the  biases  introduced  by  differen- 
tial growth  of  clones  containing  small  and  large  cDNA  in- 
serts when  single-strands  are  produced  in  vivo  upon  super- 
infection with  a  helper  phage. 
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Numerous  studies  have  confirmed  the  notion  that  mouse 
and  human  chromosomes  resemble  each  other  closely 
within  blocks  of  syntenic  homology  that  vary  widely  in 
size,  containing  from  just  a  few  to  several  hundred  related 
genes.  Within  the  best-mapped  of  these  homologous  re- 
gions, the  presence  and  location  of  specific  genes  can  be 
acciuately  predicted  in  one  species,  based  upon  the  map- 
ping results  obtained  in  the  other.  In  addition,  information 
regarding  gene  function  derived  from  the  analysis  of  hu- 
man hereditary  traits  or  mapped  murine  mutations,  can 
also  be  extrapolated  from  one  species  to  another.  However, 
syntenic  relationships  are  still  not  established  for  many 
hiunan  regions,  and  local  rearrangements  including  appar- 
ent deletions,  inversions,  insertions,  and  transposition 
events,  complicate  most  of  the  syntenically  homologous 
regions  that  appear  simple  on  the  gross  genetic  level.  Be- 
cause of  these  complications,  the  power  of  prediction  af- 
forded in  any  homology  region  increases  tremendously 
with  the  level  of  resolution  and  degree  of  internal  consis- 
tency associated  with  a  particular  set  of  comparative  map- 
ping data.  Our  groups  have  been  interested  in  further  de- 
fining the  borders  of  syntenic  linkage  groups  in  human  and 
mouse,  upon  elucidating  mechanisms  behind  evolutionary 
rearrangements  that  distinguish  chromosomes  of  mamma- 
lian species,  and  upon  devising  means  of  exploiting  the 
relationships  between  the  two  genomes  for  the  discovery 
and  analysis  of  new  genes  and  other  functional  units  in 
mouse  and  man. 

One  of  the  larger  contiguous  blocks  of  mouse-human  ge- 
nomic homology  includes  the  proximal  portion  of  mouse 
chromosome  7  (Mmu7).  Detailed  analysis  of  ihis  large  re- 
gion of  mouse-human  homology  have  served  as  the  initial 
focus  of  these  collaborative  studies.  Our  results  have 
shown  that  gene  content,  order  and  spacing  are  remarkably 
well-conserved  throughout  the  length  of  this  approxi- 
mately 23  cM/29  Mb  region  of  mouse-human  homology, 
except  for  six  internal  rearrangements  of  gene  sequence  in 
mouse  relative  to  man.  One  of  these  differences  involve  a 
small  segment  of  H19ql3.4  genes  whose  murine  counter- 
parts have  been  transposed  out  of  the  large  Mmu7/H  19q 
conserved  synteny  region  into  a  separate  linkage  group 
located  on  mouse  chromosome  17.  The  six  internal  rear- 
rangements, including  two  transpositions  and  four  local 


28 


DOE  Huinan  Genome  Program  Report,  Part  2, 1996  Research  Abstracts 


274 


Mapping 


inversions,  are  clustered  together  at  two  sites;  our  data 
suggest  that  the  rearrangements  occurred  in  a  coincident 
fashion,  or  were  commonly  associated  with  unstable  DNA 
sequences  at  those  sites.  Inlerestinglv.  both  rearranged  re 
gions  are  occupied  by  large  tandemly  clustered  gene  fami- 
lies, suggesting  that  these  locally  repeated  sequences  may 
have  contributed  to  their  evolutionary  instability  The 
structure  and  conserved  functions  of  genes  within  these 
and  other  clustered  gene  families  located  on  H 1^  also  re'p- 
resent  an  active  line  of  interest  to  our  gre>up  More  re- 
cently, we  have  extended  mapping  studies  to  include  clus- 
tered gene  families  located  in  other  chremosomal  regions, 
and  are-  working  to  define  the  borders  of  mouse-human 
syntenic  .segments  on  a  bre>ader,  genome-wide  scale. 

DOE  Contract  No.  DE  .\C0?  P60R224W  and  Contract 
No.  W-740.'>-ENG-48  with  LawTCnce  Livemiore  National 
Laboratory. 

Positional  Cloning  of  Murine  G«nes 

Lisa  Stubbs,  Cymbeline  Culiat.  Ethan  Carver.  Johannah 

Doyle.  Laura  Chittenden.  Mitchell  Walkowicz,  Nestor 

Cacheirv.  tireg  Lennon.'  Gary  Wright,-  Joe  Rutledge.' 

Robert  Nicholls.'  and  Walderico  Geneai.so 

Biology  Division;  Oak  Ridge  National  Laboratory ;  Oak 

Ridge.  TN.rs.M-S077 

423/574-0854.  Fav:  -12SX  stubhsl@hioaxl.bio.omLgov  or 

stuht'sljt^  omLgov 

'Human  Genome  Center,  Lawrence  Liverraore  National 

Laboratory ;  Livermore,  C.A  ''4550 

•University  of  Texas  Southwestern  Medical  Center  at 

Dalhis;  DaUas.  TX  75:.V'i 

X3>ildre-n"s  Ho-spital  and  Medical  Center,  I'niversity  of 

Washington  School  of  Medicine;  Seattle.  WA9SI05 

''Depaitinent  of  Genetics;  Case  Western  Reserve  LTniver- 

sity;  Cleveland.  Ohio 

Ch(v>mo.some  rearrangements,  notably  deletions  and  trans- 
locatioas.  have  pre>ved  invaluable  as  tools  in  the  mapping 
and  molecular  cloning  of  a  acquired  and  inherited  human 
diseases.  Becau.se  balanced  tran.slocations  are  cytologically 
visible,  and  generally  produce  profound  disturbances  in 
both  gene  expression  and  DN.A  structure-  w  ithout  necessar- 
ily disturbing  the  structure  of  multiple  genes,  this  type  of 
mutation  prv>vides  an  especially  valuable  lag"  that  greatly 
simplifies  nupping.  cloning,  and  assessment  of  candidate 
genes  associated  with  a  disease.  .Although  balanced  trans- 
locatioi\s  are  relatively  rare  in  human  populations,  they  are 
readily  induced  in  the  nxiuse.  Using  varioiLs  mutagenesis 
protocols,  we  have  generated  numerous  translcxration-bear- 
ing  mutant  UKiiLse  strains  that  display  an  mipres.sive  vari- 
ety of  health-re" lated  anomalies,  including  obesity,  polycys- 
tic kidneys,  gastrointestinal  disorders,  limb  and  skeletal 
defomuties.  neural  tube  defixts,  ataxias,  tremors,  heredi- 
tary deafness  and  blindncy;.  reprexhictive  dysfunctioo.  and 
complex  behav  loral  defects.  The  ability  to  map  die  genes 


associated  with  translocation  breakpoints  cytogenelically, 
first  crudeh  through  straightforward  banding  techniques 
and  then  to  a  higher  level  of  resolution  using  tluore.scence 
in  situ  hybridization  methods,  allows  us  to  avoid  the  costly 
and  time-consuming  caxsses  that  are  required  for  the  map- 
ping of  masi  mutant  genes  With  this  rapidh  -obtained, 
crude-level  mapping  infoniiation  available,  we  can  re*adily 
assess  po.ssible  relationships  between  newly  ansing  mutant 
phenotypes  and  linked  candidate  genes  or  related  diseases 
that  map  to  homologous  regions  of  the  human  genome. 
Using  this  appre>ach.  we  have  recently  begun  to  define  the 
map  positions  of  several  mutations.  Mapping  results  have 
led  us  to  the  identificabon  of  candidate  genes  for  tw  o  mu- 
tations: one  associated  with  congenital  deafne.ss  and  pre- 
disposition to  severe  gastric  ulcers,  and  another  associated 
with  late-onset  obesity.  So  far.  we  have  characterized  only 
a  fraction  of  the  mou.se  strains  that  comprise  this  valuable, 
recently-generated  mutant  collection  in  derail.  .As  a  inte- 
gral part  of  this  prv^gram.  we  are  actively  exploring  new 
strategies  and  integrating  information,  technology  and  re- 
sources derived  from  the  Human  GcnoiTR-  rc.seareh  etYort, 
that  prvmiise  to  incre'ase  the  efficiencv  of  breakpoint  map- 
ping and  cloning  dramatically  The  mutations  are  scattere-d 
w  idely  threiughout  the  mouse  genoiiK  corresponding  to  a 
broad  selection  of  human  homology  regions.  As  new 
bre'akpoints  are'  mapped,  and  large  numbers  of  new  ly-se- 
quenced  cDN.A  clones  are"  assigned  to  the  mouse  and  hu- 
man maps,  the  potential  for  rapid  association  between 
cloned  gene  and  mapped  mutation  will  incre"ase  dramati- 
callv .  This  large  collection  of  murine  translocation  mutants 
therefore  re-presents  a  powerful  ttsoiuve  for  linking 
mapped  cDN.A  clones  to  health-related  phenotypes 
throughout  the  genonw. 

In  addition  to  the  analysis  of  translocatioa  mutants,  we 
have  also  characterized  other  types  of  mouse  mutations, 
including:  ( I )  mnfnng  and  leaner,  allelic  mutations  asso- 
ciated with  ataxia  and  epilepsy  in  mice,  and  representing 
murine  nwdels  for  human  diseases,  familial  hemiplaegic 
migraine  and  episodic  ataxia,  respectively:  and  (2)jdf2.  a 
locus  associated  with  mutations  causing  runting.  neui\>- 
muscular  tre-mors  and  male  sterility  which  is  located  in  a 
mouse  re"gion  related  to  the  Prader  Willi -Angleman  syn- 
di\ime  gene  interval  of  hiuuan  1 5q  1 1  -q  1 3.  Both  .sets  of 
mutations  affect  large,  complex,  and  highly  conserved 
genes,  and  preivide  important  animal  models  for  the  explo- 
ration of  the  div  erse  roles  their  human  counterparts  may 
play  in  human  disease.  In  concert  with  these  geiK"  cloning 
studies,  we  have  been  involved  in  exploring  new  means  of 
exploiting  mouse-human  genomic  conservation  in  the  iso- 
lation of  functionally-significant  sequences  from  large 
cloned  regions  of  human  DN.A.  The  methods  we  ha\  e  de- 
veloped hold  great  ptvimise  as  an  efficient  tool  for  gene 
discovery  in  cloned  genomic  regions. 

DOE  Contract  No.  DE-AC05-%ORZ2464. 
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Human  Artificial  Episomal 
Chromosomes  (HAECS)  for  Building 
Large  Genomic  Libraries 

Min  Wang,  Panayotis  A.  loannou,'  Michael  Grosz,  Subrala 
Banerjee,  Evy  Bashiardes.'  Michelle  Rider,  Tian-Qiang 
Sun,'  and  Jean-Michel  H.  Vos' 

Lineberger  Comprehensive  Cancer  Center  and  'Depart- 
ment of  Biochemistry  and  Biophysics;  University  of  North 
Carolina;  Chapel  Hill,  NC  27599 
Vos:  919/966-3036,  Fax:  -3015.  vos@med.unc.edu 
'The  Cyprus  Institute  of  Neurology  and  Genetics;  Nicosia, 
Cyprus 

Of  some  100,000  human  genes,  only  a  few  thousand  have 
been  cloned,  mapped  or  sequenced  so  far.  Much  less  is 
known  about  other  chromosomal  regions  such  as  those 
involved  in  DNA  replication,  chromatin  packaging,  and 
chromosome  segregation.  Construction  of  detailed  physi- 
cal maps  is  only  the  first  step  in  localizing,  identifying  and 
determining  the  function  of  genetic  units  in  human  cells. 
Studying  human  gene  function  and  regulation  of  other 
critical  genomic  regions  that  span  hundreds  of  kiloba.se 
pairs  of  DNA  requires  the  ability  to  clone  an  entire  func- 
tional unit  as  a  single  DNA  fragment  and  transfer  it  stably 
into  human  cells. 

We  have  developed  a  human  artificial  episomal  chromo- 
some (HAEC)  system  based  on  latent  replication  origin  of 
the  large  herpes  Epstein-Barr  virus  (EBV)  for  the  propaga- 
tion and  stable  maintenance  of  DNA  as  circular 
minichromosomes  in  human  cells. [1, 2]  Individual  HAECS 
carried  human  genomic  inserts  ranging  from  60  to  330  kb 
and  appeared  genetically  stable.  An  HAEC  library  of  1500 
independent  clones  carrying  random  human  genomic  frag- 
ments with  average  sizes  of  150  to  200  kb  was  established 
and  allowed  recovery  of  the  HAEC  DNA.  This  autologous 
HAEC  system  with  human  DNA  segments  directly  cloned 
in  human  cells  provides  an  important  tool  for  functional 
study  of  large  mammalian  DNA  regions  and  gene 
therapy.[3,41 

Current  efforts  are  focused  on  (a)  shuttling  large  BAG/ 
PAC  genomic  inserts  in  human  and  rodent  cells  and  (b) 
packaging  BAC/PAC/HAEC  clones  as  large  infectious 
Herpes  Viruses  for  shuttling  genomic  inserts  between 
mammalian  cells  and  (c)  constructing  bacterial-ba.sed  hu- 
man and  rodent  HAEC  libraries,  (a)  We  have  designed  a 
"pop-in"  vector,  which  can  be  inserted  into  current 
BAC-or  PAC-based  clone  via  site-specific  integration. 
This  "CRE-LOXP"-mediated  system  has  been  used  to  es- 
tablish BAC/PAC  up  to  250  kb  in  size  in  human  cells  as 
HAECS  (b)  We  have  obtained  packaging  of  160-180  kb 
exogenous  DNA  into  infectious  virions  using  the  human 
lymphotropic  Epstein-Barr  virus.  After  delivery  into  hu- 
man beta-lymphoblasts  cells  the  HAEC  DNA  was  stably 


established  as  160- 1 80  kb  functional  autonomously  repli- 
cating cpisomes.|5,7)  Wc  have  also  generated  a  hybrid 
BAC/HAEC  vector,  which  can  shuttle  large  DNA  in.serts, 
i.e.,  at  least  up  to  260  kb,  between  bacteria  and  human 
cells.  Such  a  system  is  being  u.sed  to  develop  large  insert 
libraries,  whose  clones  can  be  directly  transferred  into  hu- 
man or  rodent  cells  for  functional  analysis.  These 
HAEC -derived  systems  will  provide  useful  molecular 
tools  to  study  large  genetic  units  in  humans  and  rodents, 
and  complement  the  functional  interpretation  of  current 
sequencing  efforts. 

DOE  Contract  No.  DE-FG05-9IER6I 135. 
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*Cosmid  and  cDNA  Map  of  a  Human 
Chromosome  13ql4  Region  Frequently 
Lost  at  B  Cell  Chronic  Lymphocytic 
Leukemia 

N.K.  Yankovsky,  B.I.  Kapanadze,  A.B.  Semov, 

A.V.  Baranova.  and  G.E.  Sulimova 

N.I.  Vavilov  Institute  of  General  Genetics;  Moscow 

117809,  Russia 

■H7-095/I35-5363,  Fax:  -\2%9. yankov.iky@vigg.ru  and 

hion@gla.t.apc.org  (send  to  both  addresses) 

Wc  arc  mapping  a  human  chromosome  13ql4  region  fre- 
quently lost  at  human  blood  malignancy  cold  B  cell 
chronic  lymphiKytic  leukemia  (BCLL),  The  final  goal  of 
the  project  is  to  find  putative  oncosuprcssor  gene  lost  in 
the  region  at  BCLL.  We  have  constructed  a  cosmid  contig 
between  DBS  1 168  and  D13S25  loci  in  the  region.  The 
interval  had  been  shown  to  be  in  the  center  of  the  BCLL 
associated  deletions.  The  contig  consists  of  more  than  100 
cosmids  from  LANL  human  chromosome  13  specific 
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library  (LAI3NC0I).  We  estimated  the  distance  between  probes  for  screening  new  cDNA  clones.  I.M.A.G.E.  Con- 

D13S1168  and  D13S25  loci  as  about  540  kb.  We  are  con-  sortium  (IXNL)  cDNA  clones  assigned  to  13ql4  will  be 

structing  a  transcriptional  map  of  the  region.  Seven  differ-  mapped  against  the  cosmid  contig.  Mapped  cDNA  clones 

ent  cDNA  clones  were  found  with  two  of  the  cosmid  will  be  checked  as  candidate  oncosupressor  genes  for 

clones.  All  cosmids  corresponding  to  the  minimal  tilling  BCLL. 
path  between  DI3S1I68  and  D13S25  are  being  used  as 
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BCM  Server  Core 

Daniel  Davison  and  Randall  Smith 

Baylor  College  of  Medicine;  Houston,  TX  77030 

713/798-3738,  Fax;  -3759,  davi.ion@bcm.imc.edu 

http://www.bcm.tmc.edu 

We  are  providing  a  variety  of  molecular  biology-related 
search  and  analysis  services  to  Genome  Program  investi- 
gators to  improve  the  identification  of  new  genes  and  their 
functions.  These  services  are  available  via  the  BCM 
Search  Launcher  World  Wide  Web  (WWW)  pages  which 
are  organized  by  function  and  provide  a  single 
point-of-entry  for  related  searches.  Pages  are  included  for 
I)  protein  sequence  searches,  2)  nucleic  acid  sequence 
searches,  3)  multiple  sequence  alignments,  4)  pairwise  se- 
quence alignments,  5)  gene  feature  searches,  6)  sequence 
utilities,  and  7)  protein  secondary  structure  prediction.  The 
Protein  Sequence  Search  Page,  for  example,  provides  a 
single  form  for  submitting  sequences  to  WWW  servers 
that  provide  remote  access  to  a  variety  of  different  protein 
sequence  search  tools,  including  BLAST,  FASTA, 
Smith-Waterman,  BEAUTY,  BLASTPAT,  FASTAPAT, 
PROSITE,  and  BLOCKS  searches.  The  BCM  Search 
Launcher  extends  the  functionality  of  other  WWW  ser- 
vices by  adding  additional  hypertext  links  to  results  re- 
turned by  remote  servers.  For  example,  links  to  the  NCBI's 
Entrez  database  and  to  the  Sequence  Retrieval  System 
(SRS)  are  added  to  search  results  returned  by  the  NCBI's 
WWW  BLAST  server.  These  links  provide  easy  access  to 
Medline  abstracts,  links  to  related  sequences,  and  addi- 
tional information  which  can  be  extremely  helpful  when 
analyzing  database  search  results.  For  novice  or  infrequent 
users  of  sequence  database  search  tools,  we  have  pre-set 
the  parameter  values  to  provide  the  most  informative 
fu^t-pass  sequence  analysis  possible. 

A  batch  client  interface  to  the  BCM  Search  Launcher  for 
Unix  and  Macintosh  computers  has  also  been  developed  to 
allow  multiple  input  sequences  to  be  automatically 
searched  as  a  background  task,  with  the  results  returned  as 
individual  HTML  documents  directly  on  the  user's  system. 
The  BCM  Search  Launcher  as  well  as  the  batch  client  are 
available  on  the  WWW  at  URL  http://gc.bcm.tmc.edu: 
8088/search-launcher/launcher.html. 

The  BCM/UH  Server  Core  provides  the  necessary  compu- 
tational resources  and  continuing  support  infrastructure  for 
the  BCM  Search  Launcher.  The  BCM/UH  Server  Core  is 
composed  of  three  network  servers  and  currently  supports 
electronic  mail  and  WWW-based  access;  ultimately,  spe- 
cialized client-server  access  will  also  be  provided.  The 
hardware  used  includes  a  2048 -processor  MasPar  mas- 
sively parallel  MIMD  computer,  a  DEC  /Upha  AXP/OSFI , 
a  Sun  2-processor  SparcCenter  1000  server,  and  several 
Sun  Sparc  workstations. 


In  addition  to  grouping  services  available  elsewhere  on  the 
WWW  and  providing  access  to  services  developed  at 
BCM  and  UH,  the  BCM/UH  Server  Core  will  also  provide 
access  to  services  from  developers  who  are  unwilling  or 
unable  to  provide  their  own  Internet  network  servers. 

Grant  Nos.:  DOE,  DE-FGO3-9SER62097/A00O;  National 
Library  of  Medicine,  R01-LM05792;  National  Science 
Foundation,  BIR  91-1 1695;  National  Research  Service 
Award,  F32-HG00133-0I;  NIH,  P3O-HGOO210  and 
RO1-HGOO973-01. 

A  Freely  Sharable 
Database-Management  System 
Designed  for  Use  in  Component-Based, 
Modular  Genome  Informatics  Systems^ 

Steve  Rozen,'  Lincoln  Stein,'  and  Nathan  Goodman 

The  Jackson  Laboratory;  Bar  Harbor,  ME  04609 

Goodman:  207/288-6158,  Fax:  -6078,  nat@jax.org 

'Whitehead  Institute  for  Biomedical  Research;  Cambridge, 

MA  02139 

http://goodman.jax.org 

http://www.genome.wi.mil. edu/informatics/workflow 

We  are  constructing  a  data-management  component,  built 
on  top  of  commercial  data-management  products,  tuned  to 
the  requirements  of  genome  applications.  The  core  of  this 
genome  data  manager  is  designed  to: 

•  support  the  semantic  and  object-oriented  data  models 
that  have  been  widely  embraced  for  representing  ge- 
nome data, 

•  provide  domain-specific  built-in  types  and  operations 
for  storing  and  querying  bimolecular  sequences, 

•  provide  built-in  support  for  tracking  laboratory  work 
flows,  and  admit  further  extensions  for  other 
special-purpose  types, 

•  allow  core  facilities  to  be  readily  extended  to  meet  the 
diverse  needs  of  biological  applications 

The  core  data  manager  is  being  constructed  on  top  of 
Sybase,  Oracle,  and  Informix  Universal  Server  The  soft- 
ware is  available  free  of  charge  and  is  freely 
redistributable. 

We  will  be  reporting  progress  on  the  core  data  manager's 
architecture  and  interface  at  the  URLs  above,  and  we  so- 
licit comments  on  its  design. 

DOE  Grant  No.  DE-FG02-95ER62I0I. 

'Originally  called  Database  Management  Research  for  the 
Human  Genome  Project,  this  project  was  initiated  in  1995 
at  the  Massachusetts  Institute  of  Technology-Whitehead 
Institute. 

•Projects  designaled  by  an  a-stcrisk  received  ^^maJI  emergency  granls  following  December  1992  site  reviews  by  David  GaJa-s  {formeriy  DOE  Office  of 
Health  and  EnviroamenlaJ  Rescarcfa.  which  was  renamed  Office  of  Biological  and  Environmental  Rescajch  in  1 997),  Raymond  Gesteland  (Univcnily 
of  Utah),  and  Elbcft  Branscomb  (Lawrence  Livcnnore  National  Laboralofy). 


DOE  Human  Qanome  Program  Report,  Part  2, 1996  Research  Abstracts 


33 


278 


Informatics 

A  Software  Environment  for  Large- 
Scale  Sequencing 

Mark  Graves 

Department  of  Cell  Biology;  Baylor  College  of  Medicine; 

Houston,  TX  77030 

713/798-8271,  Fax:  -3759;  mgraves@bcm.tmc.edu 

hnp://www.bcm.  tmc.  edu 

http://.<itork.  bcm.tmc.  edu/gfp 

Our  approach  is  to  implement  software  systems  which 
manage  primary  laboratory  sequence  data  and  explore  and 
annotate  functional  information  in  genome  sequence  and 
gene  products. 

Three  software  systems  have  been  developed  and  are  be- 
ing used:  two  sequence  data  managers  which  use  different 
sequence  assembly  packages,  FAK  and  Phrap,  and  a  series 
of  analysis  and  annotation  tools  which  are  available  via  the 
Internet.  In  addition,  we  have  developed  a  prototype  appli- 
cation for  data  mining  of  sequence  data  as  it  is  related  to 
metabolic  pathways. 

Products  of  this  project  are  the  following: 

1.  GRM  -a  sequence  reconstruction  manager  using  the 
FAQ  assembly  engine  (available  since  October  1995). 

2.  GFP  -a  sequence  finishing  support  tool  using  the  Phrap 
assembly  engine  (available  since  March  1996). 

3.  A  series  of  gene  recognition  tools  (available  since  early 
1996). 

4.  A  tool  for  visualizing  metabolic  pathways  data  and  ex- 
ploring sequence  data  related  to  metabolic  pathways  (pro- 
totype available  since  August  19%). 

DOE  Grant  No.  DE-FG03-94ER61618. 


Generalized  Hidden  Markov  Models 
for  Genomic  Sequence  Analysis 

David  Haussler,  Kevin  Karplus,'  and  Richard  Hughey' 

Computer  Science  Department  and  'Computer  Engineering 

Department;  University  of  California;  Santa  Cruz,  CA 

95064 

408/459  2105,  Fax:  -4829,  haus.iler@cse.ucsc.edu 

http://www.  c.ie.  ucsc.edu/research/compbio 

http://www-hgc.lbl.gov/projects/genie.html 

We  have  developed  an  integrated  probabilistic  method  for 
locating  genes  in  human  DNA  based  on  a  generalized  hid- 
den Markov  model  (HMM).  Each  state  of  a  generalized 
HMM  represents  a  particular  kind  of  region  in  DNA,  such 
as  an  initial  exon  for  a  gene.  The  states  are  connected  by 
transitions  that  model  sites  in  DNA  between  adjacent  re- 


gions, e.g.  splice  sites.  In  the  full  HMM,  parametric  statis- 
tical models  are  estimated  for  each  of  the  states  and  transi- 
tions. Generalized  HMMs  allow  a  variety  of  choices  for 
these  models,  such  as  neural  networks,  high  order  Markov 
models,  etc.  All  that  is  required  is  that  each  model  return  a 
likelihood  for  the  kind  of  region  or  transition  It  is  supposed 
to  model.  These  likelihoods  are  then  combined  by  a  dy- 
namic programming  method  to  compute  the  most  likely 
annotation  for  a  given  DNA  contig.  Here  the  annotation 
simply  consists  of  the  locations  of  the  transitions  identified 
in  the  DNA,  and  the  labeling  of  the  regions  between  transi- 
tions with  their  corresponding  states. 

This  method  has  been  implemented  in  the  genefinding  pro- 
gram Genie,  in  collaboration  with  Frank  Eeckman,  Martin 
Reese  and  Nomi  Harris  at  Lawrence  Berkeley  Labs.  David 
Kulp,  at  UCSC,  has  been  responsible  for  the  core  imple- 
mentation. Martin  Reese  developed  the  splice  site  models, 
promoter  models,  and  datasets.  You  can  access  Genie  at 
the  second  www  address  given  above,  submit  sequences, 
and  have  them  annotated.  Nomi  Harris  has  written  a  dis- 
play tool  called  Genotater  that  displays  Genie's  annotation 
along  with  the  annotation  of  other  genefinders,  as  well  as 
the  location  of  repetitive  DNA,  BLAST  hits  to  the  protein 
database,  and  other  useful  information.  Papers  and  further 
information  about  Genie  can  be  found  at  the  first  www 
address  above.  Since  the  ISMB  '96  paper.  Genie's  exon 
models  have  been  extended  to  explicitly  incorporate 
BLAST  and  BLOCKS  databa.se  hits  into  their  probabilistic 
framework.  This  results  in  a  substantial  increase  in  gene 
predicting  accuracy.  Experimental  results  in  tests  using  a 
standard  set  of  annotated  genes  showed  that  Genie  identi- 
fied 95%  of  coding  nucleotides  correctly  with  a  specificity 
of  88%,  and  76%  of  exons  were  identified  exactly. 

DOE  Grant  No.  DE-FG03-95ER621 12. 
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94306 

415/326-5588  Fax:  ■2Q0\, jurka@gnomic.stanford.edu 

http://charon.lpi.org 

There  are  three  major  objectives  in  this  project:  organiza- 
tion of  databases  of  mammalian  repetitive  sequences, 
development  of  specialized  software  for  analysis  of  repeti- 
tive DNA,  and  sequence  studies  of  new  mammalian  re- 
peats. 

Our  approach  is  based  on  extensive  usage  of  computer 
tools  to  investigate  and  organize  publicly  available  se- 
quence information.  We  also  pursue  collaborative  research 
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with  experimental  laboratories.  The  results  are  widely  dis- 
seminated via  the  internet,  peer  reviewed  scientific  publi- 
cations and  personal  interactions.  OiU'  most  recent  research 
concentrates  on  mechanisms  of  retroposon  integration  in 
mammals  (Jurka,  J.,  PNAS,  in  press;  Jurka,  J  and 
Klonowski.  P.,  J.  Mol.  Evol.  43:685-689). 

We  continue  to  develop  reference  collections  of  mamma- 
lian repeats  which  became  a  worldwide  resource  for  anno- 
tation and  study  of  newly  sequenced  DNA.  The  reference 
collections  are  being  revised  annually  as  part  of  a  larger 
database  of  repetitive  DNA,  called  Repbase.  The  recent 
influx  of  sequence  data  to  public  databases  created  an  un- 
precedented need  for  automatic  annotation  of  known  re- 
petitive elements.  We  have  designed  and  implemented  a 
program  for  identification  and  elimination  of  repetitive 
DNA  known  as  CENSOR. 

Reference  collections  of  mammalian  repeats  and  the  CEN- 
SOR program  are  available  electronically  (via  anonymous 
ftp  to  ncbi.nih.gov;  directory  repository/repbase).  CEN- 
SOR can  also  be  run  via  electronic  mail  (mail  "help"  mes- 
sage to  censor@charon.lpi.org). 

DOE  Grant  No.  DE-FG03-95ER62139. 
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'Gesellschaft  fUr  Biotechnologische  Forschung; 

Braunschweig,  Germany 

^Department  of  Genetics;  University  of  Pennsylvania 

School  of  Medicine;  Philadelphia,  PA  19104-6145 

The  database  on  transcription  regulatory  regions  in  eukary- 
otic  genomes  (TRRD)  has  been  developed  [1]  (http:// 
www.bionet.nsk.suyTRRD.html;  ftp://ftp.bionet.nsk.su/ 
pub/trrd/).  The  main  principle  of  data  representation  in 
TRRD  is  modular  structure  and  hierarchy  of  transcription 
regulatory  regions.  TRRD  entry  corresponds  to  a  gene  as 
entire  unit.  Information  on  gene  regulation  is  provided 
(cell-cycle  and  cell  type  specificity,  developmental 
stage-specificity,  influence  of  various  molecular  signals  on 
gene  expression).  TRRD  database  contains  information 
about  structural  organization  of  gene  transcription  regula- 
tory region.  TRRD  contains  description  of  known  promot- 
es and  enhancers  in  5',  3'  regions  and  in  introns.  Descrip- 


tion of  binding  sites  for  transcription  factors  includes 
nucleotide  sequence  and  precise  location,  name  of  factors 
that  bind  to  the  site,  experimental  evidences  for  the  bind- 
ing site  revealing.  We  provide  cross-references  to 
TRANSFAC  database  [2]  for  both  sites  and  factors  as  well 
as  for  genes.  TRRD  3.3  release  includes  340  vertebrate 
genes. 

The  Gene  Expression  Regulation  Database  (GERD)  col- 
lects information  on  features  of  genes  expression  as  well 
as  information  about  gene  transcription  regulation.  The 
current  release  of  GERD  contains  75  entries  with  informa- 
tion on  expression  regulation  of  genes  expressed  in  he- 
matopoietic tissues  in  the  course  of  ontogenesis  and  blood 
cells  differentiation.  COMPEL  database  contains  informa- 
tion about  composite  elements  which  are  functional  units 
essential  for  highly  specific  transcription  regulation  [3]. 
Direct  interactions  between  transcription  factors  binding  to 
their  target  sites  within  composite  elements  result  in  con- 
vergence of  different  signal  transduction  pathways.  Nucle- 
otide sequences  and  positions  of  composite  elements, 
binding  factors  and  types  of  their  DNA  binding  domains, 
experimental  evidence  confuming  synergistic  or  antago- 
nistic action  of  factors  are  registered  in  COMPEL. 
Cross-references  to  TRANSFAC  factors  table  are  given. 
TRRD  and  COMPEL  are  provided  by  cross-references  to 
each  other.  COMPEL  2.1  release  includes  140  composite 
elements. 

We  have  developed  a  software  for  analysis  of  transcription 
regulatory  region  structure.  The  CompSearch  program  is 
based  on  oligonucleotide  weight  matrix  method.  To  collect 
sets  of  binding  sites  for  the  matrixes  construction  we  have 
used  TRANSFAC  and  TRRD  databases.  The  CompSearch 
program  takes  into  account  the  fine  structure  of  experi- 
mentally confumed  NFATp/AP-1  composite  elements  col- 
lected in  COMPEL  (distances  between  binding  sites  in 
composite  elements,  their  mutual  orientation).  By  means 
of  the  program  we  have  found  potential  composite  ele- 
ments of  NFATp/AP- 1  type  in  the  regulatory  regions  of 
various  cytokine  genes.  Analysis  of  composite  elements 
could  be  the  first  approach  to  reveal  specific  patterns  of 
transcription  signals  encoding  regulatory  potential  of  eu- 
karyotic  promoters. 
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http://gizmo.lbl.gov/opm.html 

The  Object-Protocol  Model  (OPM)  data  management  tools 
provide  facilities  for  constructing,  maintaining,  and  explor- 
ing efficiently  molecular  biology  databases.  Molecular  bi- 
ology data  are  currently  maintained  in  numerous  molecular 
biology  databases  (MBDs),  including  large  archival  MBDs 
such  as  the  Genome  Database  (GDB)  at  Johns  Hopkins 
School  of  Medicine,  the  Genome  Sequence  Data  Base 
(GSDB)  at  the  National  Center  for  Genome  Resources, 
and  the  Protein  Data  Bank  (PDB)  at  Brookhaven  National 
Laboratory.  Constructing,  maintaining,  and  exploring 
MBDs  entail  complex  and  time-consuming  processes. 

The  goal  of  the  Object-Protocol  Model  (OPM)  data  man- 
agement tools  is  to  provide  facilities  for  efficiently  con- 
structing, maintaining,  and  exploring  MBDs,  using 
application-specific  constructs  on  top  of  commercial  data- 
base management  systems  (DBMSs).  The  OPM  tools  wiU 


also  provide  facilities  for  reorganizing  MBDs  and  for  ex- 
ploring seamlessly  heterogenous  MBDs.  The  OPM  tools 
and  documentation  are  available  on  the  Web  and  are  devel- 
oped in  close  collaboration  with  groups  maintaining 
MBDs.  such  as  GDB,  GSDB,  and  PDB. 

Current  work  focuses  on  providing  new  facilities  for  con- 
structing and  exploring  MBDs.  The  specific  aims  of  this 
work  are; 

(1)  Extend  the  OPM  query  language  with  additional  con- 
structs for  expressing  complex  conditions,  and  enhance  the 
OPM  query  optimizer  for  generating  more  efficient  query 
plans. 

(2)  Develop  enhanced  OPM  query  interfaces  supporting 
MBD-specific  data  types  (e.g.,  protein  data  type)  and  op- 
erations (e.g.,  protein  data  display  and  3D  search),  and  as- 
sisting users  in  specifying  and  interpreting  query  results. 

(3)  Provide  support  for  customizing  MBD  interfaces. 

(4)  Extend  the  OPM  tools  with  facilities  for  managing  per- 
missions (object  ownership)  in  MBDs,  and  for  physical 
database  design  of  relational  MBDs,  including  specifica- 
tion of  indexes,  allocation  of  segments,  and  handling  of 
redundant  (denormalized)  data. 

(5)  Develop  OPM  tools  for  constructing  and  maintaining 
multiple  OPM  views  for  both  relational  and  non-relational 
(e.g.,  ASN.l,  AceDB)  MBDs.  For  a  given  MBD,  these  tools 
will  allow  customizing  different  OPM  views  for  different 
groups  of  scientists.  For  heterogeneous  MBDs,  this  tool  will 
allow  exploring  them  using  common  OPM  interfaces. 

(6)  Develop  tools  for  constructing  OPM  based 
multidatabase  systems  of  heterogeneous  MBDs  and  for 
exploring  and  manipulating  data  in  these  MBDs  via  OPM 
interfaces.  As  part  of  this  effort,  the  OPM-based 
multidatabase  system  which  consists  currently  of  GDB  6.0 
and  GSDB  2.0,  will  be  extended  to  include  additional 
MBDs,  primarily  GSDB  2.2  (when  it  becomes  available), 
PDB,  and  Genbank. 

(7)  Develop  facilities  for  reorganizing  OPM-based 
MBDs.The  database  reorganization  tools  will  support  au- 
tomatic generation  of  procedures  for  reorganizing  MBDs 
following  restructuring  (revision)  of  MBD  schemas. 

In  the  past  year,  the  OPM  data  management  tools  have  been 
extended  in  order  to  address  specific  requirements  of  devel- 
oping MBDs  such  as  GDB  6  and  the  new  version  of  PDB. 

The  current  version  of  the  OPM  data  management  tools 
(4.1)  was  released  in  June  1996  for  Sun/OS,  Sun/Solaris 
and  SGI.  The  following  OPM  tools  are  available  on  the 
Web  at  http://gizmo.lbl.gov/opm.html; 

(1)  an  editor  for  specifying  OPM  schemas; 
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(2)  a  translator  of  OPM  schemas  into  relational  database 
specifications  and  procedures; 

(3)  utilities  for  publishing  OPM  schemas  in  text  (Latex), 
diagram  (Postscript),  and  Html  formats; 

(4)  a  translator  of  OPM  queries  into  SQL  queries; 

(5)  a  retrofitting  tool  for  constructing  OPM  schemas 
(views)  for  existing  relational  genomic  databases; 

(6)  a  tool  for  constructing  Web-based  form  interfaces  to 
MBDs  that  have  an  OPM  schema;  this  tool  was  developed 
by  Stan  Letovsky  at  Johns  Hopkins  School  of  Medicine,  as 
part  of  a  collaboration. 

The  OPM  data  management  tools  have  been  highly  suc- 
cessful in  developing  new  genomic  databases,  such  as 
GDB  6  (released  in  January  1996;  http://gdbgeneral.gdb. 
org/gdb/)  and  the  relational  version  of  PDB  (http:// 
terminator.pdb.bnl.gov:4148),  and  in  constructing  OPM 
views  and  interfaces  for  existing  genomic  databases  such 
as  GSDB  2.0.  The  OPM  data  management  tools  are  cur- 
rently used  by  over  ten  groups  in  USA  and  Europe.  The 
research  underlying  these  tools  is  described  in  several  pa- 
pers published  in  scientific  journals  and  presented  at  data- 
base and  genome  conferences. 

In  the  past  year  the  OPM  tools  have  been  presented  at  da- 
tabase and  bioinformatics  conferences,  including  the  Inter- 
national Symposium  on  Theoretical  and  Computational 
Genome  Research,  Heidelberg,  Germany,  March  1996,  the 
Workshop  on  Structuring  Biological  Information,  Heidel- 
berg, Germany,  March  1996,  the  Meeting  on  Genome 
Mapping  and  Sequencing,  Cold  Spring  Harbor,  May  1996, 
the  International  Sybase  User  Group  Conference,  May 
1996,  the  Bioinformatics  -Structure  Conference,  Jerusa- 
lem, November  1996,  and  the  Pacific  Symposium  on 
Bioinformatics,  January  1997. 

The  results  of  the  research  and  development  underlying 
the  OPM  tools  work  have  been  presented  in  papers  pub- 
lished in  proceedings  of  database  and  bioinformatics  con- 
ferences; these  papers  are  available  at  http://gizmo.lbl.gov/ 
opm.html#Publications. 

DOE  Contract  No.  DE-AC03-76SF0OO98. 
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Genome  Topographer  (GT)  is  an  advanced  genome 
informatics  system  that  has  received  joint  funding  from 
DOE  and  NTH  over  a  number  of  years.  DOE  funding  has 
focused  on  GT  tools  supporting  computational  genome 
analysis,  principally  on  sequence  analysis.  GT  is  scheduled 
for  public  release  next  spring  under  the  auspices  of  the 
Cold  Spring  Harbor  Human  Genome  Informatics  Research 
Resource.  GT  has  17  major  existing  frameworks:  1 .  Views, 
including  printing,  2.  Default  manager,  3.  Graphical  User 
Interface,  4.  Query,  5.  Project  Manager,  6.  Workspace 
Manager,  7.  Asynchronous  Process  Manager,  8.  Study 
Manager,  9.  Help,  10.  Application,  11,  Notification,  12. 
Security,  13.  World  Wide  Web  Interface,  14.  NCBI,  15. 
Reader,  16.  Writer,  17.  External  Database  Interface.  GT 
Frameworks  are  independent  sets  of  VisualWorks  (client) 
or  SmallTalkDB  (GemStone)  classes  which  interact  to  per- 
form the  duties  required  to  satisfy  the  responsibilities  of 
the  specific  framework.  Each  framework  is  clearly  defined 
and  has  a  well-defined  interface  to  use  it.  These  frame- 
works are  used  over  and  over  in  GT  to  perform  similar  du- 
ties in  different  places.  GT  has  basic  tools  and  special 
tools.  Basic  tools  get  used  many  times  in  different  applica- 
tions, while  special  tools  tend  to  be  special  purpose,  de- 
signed to  do  fairly  limited  things,  although  the  distinction 
is  somewhat  arbitrary.  Tools  typically  use  several  frame- 
works when  they  get  assembled.  Basic  Tools:  1 .  Project 
Browser,  2.  Editor/Viewer,  3.  Query,  4.  NCBI  Entrez,  5. 
File  reader/vmter,  6.  Map  comparison,  7.  Database  Admin- 
istrator, 8.  Login,  9.  Default,  10.  Help.  Special  Tools:  1. 
Study  Manager,  2.  Compute  Server,  3.  Sequence  Analysis, 
4.  Genetic  /^alysis.  These  frameworks  and  tools  are  com- 
bined with  a  comprehensive  database  schema  of  very  rich 
biological  expression  linked  with  plugable  computational 
tools.  Taken  together,  these  features  allow  users  to  con- 
struct, with  relative  ease,  on-line  databases  of  the  primary 
data  needed  to  study  a  genetic  disease  (or  genes  and  phe- 
notypes  in  general)  from  the  stage  of  family  collection  and 
diagnostic  ascertainment  through  cloning  and  functional 
analysis  of  candidate  genes,  including  mutational  analysis, 
expression  information,  and  screening  for  biochemical  in- 
teractions with  candidate  molecules.  GT  was  designed  on 
the  premise  that  a  highly  informative,  visual  presentation 
of  comprehensive  data  to  a  knowledgeable  user  is  essential 
to  their  understanding.  The  advanced  software  engineering 
techniques  that  are  promoted  by  using  relatively  new  ob- 
ject oriented  products  has  allowed  GT  to  become  a  highly 
interactive  and  visually-oriented  system  that  allows  the 
user  to  concentrate  on  the  problem  rather  than  on  the  com- 
puter. Using  the  rich  data  representational  features  charac- 
teristic of  this  technology,  the  GT  software  enables  users  to 
construct  models  of  real-world,  complex  biological  phe- 
nomena. These  unique  features  of  GT  are  key  to  the  thesis 
that  such  a  system  will  allow  users  to  discover  otherwise 
intractable  networks  of  interactions  exhibited  by  complex 
genetic  diseases. 
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The  VisualWorks  development  environment  allows  the 
development  of  code  that  runs  unchanged  across  all  major 
workstation  and  personal  computers,  including  PCS, 
Macintoshes  and  most  Unix  workstations. 

DOE  Grant  No.  DE-FG02-91ER61190. 

A  Flexible  Sequence  Reconstructor  for 
Large-Scale  DNA  Sequencing:  A 
Customizable  Software  System  for 
Fragment  Assembly 

Gene  Myers  and  Susan  Larson 

Department  of  Computer  Science:  University  of  Arizona; 

Tucson.  AZ  85721 

602/621-6612,  Fax:  A2A(s.  gene@cs.arizona.edu 

http://www.  C.I.  arizona.  edu/faktory 

We  have  completed  the  design  and  begun  construction  of  a 
software  environment  in  support  of  DNA  sequencing 
called  the  "FAKtory".  The  environment  consists  of  ( 1 )  oiu- 
previously  described  software  library,  FAK,  for  the  core 
combinatorial  problem  of  assembling  fragments.  (2)  a  Tel/ 
Tk  based  interface,  and  (3)  a  software  suite  supporting  a 
modest  database  of  fragments  and  a  processing  pipeline 
that  includes  clipping  and  vector  prescreening  modules.  A 
key  feature  of  our  system  is  that  it  is  highly  customizable; 
the  structure  of  the  fragment  database,  the  processing  pipe- 
line, and  the  operation  of  each  phase  of  the  pipeline  are 
specifiable  by  the  user.  Such  customization  need  only  be 
established  once  at  a  given  location,  subsequently  users 
see  a  relatively  simple  system  tailored  to  their  needs.  In- 
deed one  may  direct  the  system  to  input  a  raw  dataset  of 
say  ABI  trace  files,  pass  them  through  a  customized  pipe- 
line, and  view  the  resulting  assembly  with  two  button 
clicks. 

The  system  is  built  on  top  of  our  FAK  software  library  and 
as  a  consequence  one  receives  (a)  high-sensitivity  overlap 
detection,  (b)  correct  resolution  to  large  high-fideUty  re- 
peats, (c)  near  perfect  multi-alignments,  and  (d)  support  of 
constraints  that  must  be  satisfied  by  the  resulting  assem- 
blies. The  FAKtory  assumes  a  processing  pipeline  for  frag- 
ments that  consists  of  an  INPUT  phase,  any  number  and 
sequence  of  CLIP.  PRESCREEN.  and  TAG  phases,  fol- 
lowed by  an  OVERLAP  and  then  an  ASSEMBLY  phase. 
The  sequence  of  clip,  prescreen,  and  tag  phases  is 
customizable  and  every  phase  is  controlled  by  a  panel  of 
user-settable  preferences  each  of  which  permits  setting  the 
phase's  mode  to  AUTO,  SUPERVISED,  or  MANUAL. 
This  setting  determines  the  level  of  interaction  required  by 
the  user  when  the  phase  is  run,  ranging  from  none  to 
hands-on.  Any  diagnostic  situations  detected  during  pipe- 
line processing  are  organized  into  a  log  that  permits  one  to 


confirm,  correct,  or  undo  decisions  that  might  have  been 
made  automatically. 

The  customized  fragment  database  contains  fields  whose 
type  may  be  chosen  from  TIME,  TEXT,  NUMBER,  and 
WAVEFORM.  One  can  associate  default  values  for  fields 
unspecified  on  input  and  specify  a  control  vocabulary  lim- 
iting the  range  of  acceptable  values  for  a  given  field  (e.g., 
John,  Joe,  or  Mary  for  the  field  Technician,  and  [  I,  36]  for 
the  field  Lane).  This  database  may  be  queried  with 
SQL-like  predicates  that  further  permit  approximate 
matching  over  text  fields.  Common  queries  and/or  sets  of 
fragments  selected  by  them  may  be  named  and  referred  to 
later  by  said  name.  The  pipeline  status  of  a  fragment  may 
be  part  of  a  query. 

The  system  permits  one  to  maintain  a  collection  of  alterna- 
tive assemblies,  to  compare  them  to  see  how  they  are  dif- 
ferent, and  directly  manipulate  assemblies  in  a  fashion 
consistent  with  sequence  overlaps.  The  system  can  be  cus- 
tomized so  that  a  priori  constraints  reflecting  a  given  se- 
quencing protocol  (e.g.  double-barreled  or  transposon- 
mapped)  are  automatically  produced  according  to  the  syn- 
tax of  the  names  of  fragments  (e.g.  X.f  and  X.r  for  any  X 
are  mates  for  double-barreled  sequencing).  The  system 
presents  visualizations  of  the  constraints  applied  to  an  as- 
sembly, and  one  may  experiment  with  an  assembly  by  add- 
ing and/or  removing  constraints.  Finally,  one  may  edit  the 
multi-alignment  of  an  assembly  while  consulting  the  raw 
waveforms.  Special  attention  was  given  to  optimizing  the 
ergonomics  of  this  time-intensive  task. 

DOE  Grant  No.  DE-FG03-94ER6I9I  I. 

The  Role  of  Integrated  Software  and 
Databases  in  Genome  Sequence 
Interpretation  and  Metabolic 
Reconstruction 

Terry  Gaasterland,  Natalia  Maltsev,  Ross  Overbeek,  and 

Evgeni  Selkov 

Mathematics  and  Computer  Science  Division;  Argoone 

National  Laboratory;  Argonne,  IL  60439 

630/252-4171,  Fax:   $<)%(>,  gaasterl@mcs.anl.gov 

MAGPIE:  hnp://www.mcs.anLgov/home/gaasterl/ 
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WIT:  hnp://www.cme.msu.edu/WIT 

As  scientists  successfully  sequence  complete  genomes,  the 
issue  of  how  to  organize  the  large  quantities  of  evolving 
sequence  data  becomes  paramount  Through  our  work  in 
comparative  whole  genome  analysis  (MAGPIE, 
Gaasterland)  and  metabolic  reconstruction  algorithms 
(WIT,  Overbeek,  Maltsev,  and  Selkov),  we  carry  genome 
interpretation  beyond  the  identification  of  gene  products  to 
customized  views  of  an  organism's  functional  properties. 
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MAGPIE  is  a  system  designed  to  reside  locally  at  the  site 
of  a  genome  project  and  actively  carry  out  analysis  of  ge- 
nome sequence  data  as  it  is  generated.'^  DNA  sequences 
produced  in  a  sequencing  project  mature  through  a  series 
of  stages  that  each  require  different  analysis  activities. 
Even  after  DNA  has  been  assembled  into  contiguous  frag- 
ments and  eventually  into  a  single  genome,  it  must  be 
regularly  reanalyzed.  Any  new  data  in  public  sequence  da- 
tabases may  provide  clues  to  the  identity  of  genes.  Over  a 
year,  for  2  megabases  with  4-fold  coverage,  MAGPIE  will 
request  on  the  order  of  100,000  outputs  from  remote 
analysis  software,  manipulate  and  manage  the  output,  up- 
date the  current  analysis  of  the  sequence  data,  and  monitor 
the  project  sequence  data  for  changes  that  initiate  reanaly- 
sis. 

In  collaboration  with  Canada's  Institute  for  Marine  Bio- 
sciences  and  the  Canadian  Instimte  for  Advanced  Re- 
search, MAGPIE  is  being  used  to  maintain  and  study  com- 
parative views  of  all  open  reading  frames  (ORFs)  across 
fully  sequenced  genomes  (currently  5),  nearly  completed 
genomes  (currently  2)  and  1  genome  in  progress 
(Sulfolobus  solfataricus).  Together,  these  genomes  repre- 
sent multiple  archaeal  and  bacterial  genomes  and  one  eu- 
karyotic  genome.  This  analysis  provides  the  necessary  data 
to  assign  phylogenetic  classifications  to  each  ORE  (e.g., 
"AE"  for  archaeal  and  eukaryotic).  This  data  in  turn  pro- 
vides the  basis  for  validating  and  assessing  functional  an- 
notations according  to  phylogenetic  neighborhood  (e.g., 
selecting  the  eukaryotic  form  of  a  biochemical  function 
over  a  bacterial  form  for  an  "AE"  ORE).' 

Once  an  automated  functional  overview  has  been  estab- 
lished, it  remains  to  pinpoint  the  organisms'  exact  meta- 
bolic pathways  and  establish  how  they  interact.To  this  end, 
the  WIT  (What  Is  There)  system  supports  efforts  to  de- 
velop metabolic  reconstructions.  Such  constructions,  or 
models,  are  based  on  sequence  data,  clearly  established 
biochemistry  of  specific  organisms,  understanding  of  the 
interdependencies  of  biochemical  mechanisms.  WIT  thus 
offers  a  valuable  tool  for  testing  current  hypotheses  about 
microbial  behavior  For  example,  a  reconstruction  may 
begin  with  a  set  of  established  enzymes  (enzymes  with 
strong  similarities  in  identified  coding  regions  to  existing 
sequences  for  which  the  enzymatic  function  is  known)  and 
putative  enzymes  (enzymes  with  weak  similarity  to  se- 
quences of  known  function).  From  these  initial  "hits," 
within  a  phylogenetic  perspective,  we  identify  an  initial  set 
of  pathways.  This  set  can  be  used  to  generate  a  set  of  ex- 
pected enzymes  (enzymes  that  have  not  been  clearly  de- 
tected, but  that  would  be  expected  given  the  set  of  hypoth- 
esized pathways)  and  missing  enzymes  (enzymes  that  oc- 
cur in  the  pathways  but  for  which  no  sequence  has  yet 
been  biochemically  identified  for  any  organism).  Further 
reasoning  identifies  tentative  coimective  pathways. 


In  addition  to  helping  curators  develop  metabolic  recon- 
structions, WIT  lets  users  examine  models  curated  by  ex- 
perts, follow  connections  between  more  than  two  thousand 
metabolic  diagrams,  and  compare  models  (e.g.,  which  of 
certain  genes  that  are  conserved  among  bacterial  genomes 
are  found  in  higher  life).  The  objective  is  to  set  the  .stage 
for  meaningful  simulations  of  microbial  behavior  and  thus 
to  advance  our  understanding  of  microbial  biochemistry 
and  genetics. 

DOE  Contract  No.  W-3 1  - 1 09-Eng-38  (ANL  FWP  No. 
60427). 
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We  have  implemented  a  general-purpose  query  system, 
Kleisli,  that  provides  access  to  a  variety  of  "non-standard" 
data  sources  (e.g.,  ACeDB,  ASN.l,  BLAST),  as  well  as  to 
"standard"  relational  databases.  The  system  represents  a 
major  advance  in  the  ability  to  integrate  the  growing  num 
ber  and  diversity  of  biology  data  sources  conveniently  and 
efficiently.  It  features  a  uniform  query  interface,  the  CPL 
query  language,  across  heterogeneous  data  sources,  a 
modular  and  extensible  architecture,  and  most  significantly 
for  dealing  with  the  Internet  environment,  a  programmable 
optimizer  We  have  demonstrated  the  utility  of  the  system 
in  composing  and  executing  queries  that  were  considered 
difficult,  if  not  unanswerable,  without  first  either  building 
a  monolithic  database  or  writing  highly  application- 
specific  integration  code  (details  and  examples  available  at 
URL  above). 

In  conjunction  with  other  software  developed  in  our  group, 
we  have  assembled  a  toolset  that  supports  a  range  of  data 


DOE  Human  Genome  Program  Report,  Part  2, 1996  Research  Abstracts 


39 


<1    on    QO      in 


284 


Informatics 

integration  strategies  as  well  as  the  ability  to  create  spe- 
cialized data  warehouses  initialized  from  community  data- 
bases. Our  integration  strategy  is  based  upon  the  concept 
of  "mediators",  which  serve  a  group  of  related  applications 
by  providing  a  uniform  structural  interface  to  the  relevant 
data  sources.  This  approach  is  cost-effective  in  terms  of 
query  development  time  and  maintenance.  We  have  exam- 
ined in  detail  methods  for  optimizing  queries  such  as  "re- 
trieve all  known  human  sequence  containing  an  Alu  repeat 
in  an  intragenic  region"  where  the  data  sources  are  hetero- 
geneous and  distributed  across  the  Internet. 

Transformation  of  data  resources,  that  is  the  structural  re- 
organization of  a  data  resource  from  one  form  to  another, 
arises  frequently  in  genome  informatics.  Examples  include 
the  creation  of  data  warehouses  and  database  evolution. 
Implementing  such  transformations  by  hand  on  a  case  by 
case  basis  is  time  consuming  and  error  prone.  Conse- 
quently there  is  a  need  for  a  method  of  specifying,  imple- 
menting and  formally  verifying  transformations  in  a  uni- 
form way  across  a  wide  variety  of  different  data  models. 
Morphase  is  a  prototype  system  for  specifying  transforma- 
tions between  data  sources  and  targets  in  an  intuitively  ap- 
pealing, declarative  language  based  on  Horn  clause  logic. 
Transformations  specification  in  Morphase  are  translated 
into  CPL  and  executed  in  the  Kleisli  system.  The 
data-types  underlying  Morphase  include  arbitrarily  nested 
records,  sets,  variants,  lists  and  object  identity,  thus  captur- 
ing the  types  common  to  most  data  formats  relevant  to  ge- 
nome informatics,  including  ASN.  I  and  ACE.  Morphase 
can  be  connected  to  a  wide  variety  of  data  sources  simulta- 
neously through  KJeisli.  In  this  way,  data  can  be  read  from 
multiple  heterogeneous  data  sources,  transformed  using 
Morphase  according  to  the  desired  output  format,  and  in- 
serted into  the  target  data  source. 

We  have  tested  Morphase  by  applying  it  to  a  variety  of 
different  transformation  problems  involving  Sybase,  ACE 
and  ASN.  1 .  For  example,  we  used  it  to  specify  a  transfor- 
mation between  the  Sanger  Center's  Chromosome  22  ACE 
database  (ACE22DB)  and  a  Chromosome  22  Sybase  data- 
base (Chr22DB),  as  well  as  between  a  portion  of  GDB  and 
Chr22DB.  Some  of  these  transformations  had  already  been 
hand-coded  without  our  tools,  forming  a  basis  for  compari- 
son. 

Once  the  semantic  correspondences  between  objects  in  the 
various  databases  were  understood,  writing  the  transforma- 
tion program  in  Morphase  was  easy,  even  by  a  non-expert, 
of  the  system.  Furthermore,  it  was  easy  to  find  conceptual 
errors  in  the  transformation  specification.  In  contrast,  the 
hand-coded  programs  were  obtuse,  difficult  to  understand, 
and  even  more  difficult  to  debug. 

DOE  Grant  No.  DE-FG02-94ER61923. 


Relevant  Publications 

P  Bimemar.  SB  Davidson,  K.  Hart,  C  Overton  and  L  Wong."A  Data 

Transformation  System  for  Biological  Data  Sources,"  in  Proceedings 

of  VLDB,  ScpL  1995  (Zurich.  Switzerland).  Also  available  a,s 

Technical  Report  MS-CIS-95-l  0.  University  of  Pennsylvania.  March 

1993 
SB.  Davidson.  C.  Overton  and  P.  Buneman,  "Challenges  in  Integrating 

Biological  Data  Sources."  J.  Computational  Biology  2  (1995).  pp 

557-572. 
A.  Kosky.  "Transforming  Databa.ses  with  Recursive  Data  Structures," 

PhD  Thesis.  December  1995 
SB   Davidson  and  A   Kosky.  "Effecting  Database  Transformations  Using 

Morphase."  Technical  Report  MS-CIS-96-05.  University  of 

Pennsylvania. 
A.  Kosky.  S.B.  Davidson  and  P.  Buneman,  "Semantics  of  Database 

Transformations,"  Technical  Report  MS-CIS-95-25.  University  of 

Pennsylvania.  1995. 
K.  Hart  and  L.  Wong.  "Pruning  Nested  Data  Values  Using  Branch 

Expressions  With  Wildcards,"  In  Abstracts  of  MIMBD.  Cambridge. 

England.  July  1995. 


Las  Vegas  Algorithm  for  G^ne 
Recognition:  Suboptimal  and 
Error-Tolerant  Spliced  Alignment 

Sing  Hoi  Sze  and  Pavel  A.  Pevzner' 

Departments  of  Computer  Science  and  'Mathematics; 
University  of  Southern  California;  Los  Angeles.  CA  90089 
Pevzner:  213/740-2407,  Fax:  -2424 
ppe  vzner@hto.  use.  edu 
hnp://www-hto.u.ic.edu/software/procruste.'! 

Recently,  Gelfand,  Mironov,  and  Pevzner  (Proc.  Natl. 
Acad.  Sci.  USA,  1996,  9061-9066)  proposed  a  spUced 
alignment  approach  to  gene  recognition  that  provides  99% 
accurate  recognition  of  human  gene  if  a  related  mamma- 
lian protein  is  available.  However,  even  99%  accurate  gene 
predictions  are  insufficient  for  automated  sequence  annota- 
tion in  large-scale  sequencing  projects  and  therefore  have 
to  be  complemented  by  experimental  gene  verification. 
100%  accurate  gene  predictions  would  lead  to  a  substantial 
reduction  of  experimental  work  on  gene  identification.  Our 
goal  is  to  develop  an  algorithm  that  either  predicts  an  exon 
assembly  with  accuracy  sufficient  for  sequence  annotation 
or  warns  a  biologist  that  the  accuracy  of  a  prediction  is 
insufficient  and  further  experimental  work  is  required.  We 
study  suboptimal  and  error-tolerant  spliced  alignment 
problems  as  the  first  steps  towards  such  an  algorithm,  and 
report  an  algorithm  which  provides  100%  accurate  recog- 
nition of  human  genes  in  37%  of  cases  (if  a  related  mam- 
malian protein  is  available).  For  52%  of  genes,  the  algo- 
rithm predicts  at  least  one  exon  with  100%  accuracy. 

DOE  Grant  No.  DE-FG03-97ER62383. 
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Viewed  as  strings  of  symbols,  biological  macromolecules 
can  be  modelled  as  elements  of  formal  languages.  Genera- 
tive grammars  have  been  useful  in  molecular  biology  for 
purposes  of  syntactic  pattern  recognition,  for  example  in 
the  author's  work  on  the  GenLang  pattern  matching  sys- 
tem, which  is  able  to  describe  and  delect  patterns  that  are 
probably  beyond  the  capability  of  a  regular  expression 
specification.  More  recently,  grammars  have  been  used  to 
capture  intramolecular  interactions  or  long-distance  depen- 
dencies between  residues,  such  as  those  arising  in  folded 
structures.  In  the  work  of  Haussler  and  colleagues,  for  ex- 
ample, stochastic  context-free  grammars  have  been  used  as 
a  framework  for  "learning"  folded  RNA  structures  such  as 
tRNAs,  capturing  both  primary  sequence  information  and 
secondary  structural  covariation.  Such  advances  make  the 
study  of  the  formal  status  of  the  language  of  biological 
macromolecules  highly  relevant,  and  in  particular  the  find- 
ing that  DNA  is  beyond  context-free  has  already  created 
challenges  in  algorithm  design. 

Moreover,  to  date,  such  methods  have  not  been  able  to 
capture  relationships  between  strings  in  a  collection,  such 
as  those  that  arise  via  intermolecular  interactions,  or  evolu- 
tionary relationships  implicit  in  alignments.  Recently  we 
have  attempted  to  remedy  this  by  showing  (1)  how  formal 
grammars  can  be  extended  to  describe  interacting  collec- 
tions of  molecules,  such  as  hybridization  products  and, 
potentially,  multimeric  or  physiological  protein  interac- 
tions, and  (2)  how  simple  automata  can  be  used  to  model 
evolutionary  relationships  in  such  a  way  that  complex 
model-based  alignment  algorithms  can  be  automatically 
generated  by  means  of  visual  programming.  These  results 
allow  for  a  useful  generalization  of  the  language-theoretic 
methods  now  applied  to  single  molecules. 

In  addition,  we  describe  a  new  software  package — 
bioWidget — for  the  rapid  development  and  deployment  of 
graphical  user  interfaces  (GUIs)  designed  for  the  scientific 
visualization  of  molecular,  cellular  and  genomics  informa- 
tion. The  overarching  philosophy  behind  bio  Widgets  is 
componentry:  that  is,  the  creation  of  adaptable,  reusable 
software,  deployed  in  modules  that  are  easily  incorporated 
in  a  variety  of  applications  and  in  such  a  way  as  to  pro- 
mote interaction  between  those  applications.  This  is  in 
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sharp  distinction  to  the  common  practice  of  developing 
dedicated  applications.  The  bioWidgets  project  addition- 
ally focuses  on  the  development  of  specific  applications 
based  on  bioWidget  componentry,  including  chromo- 
somes, maps,  and  nucleic  acid  and  peptide  sequences. 

The  current  set  of  bioWidgets  has  been  implemented  in 
Java  with  the  goal  in  mind  of  delivering  local  applications 
and  distributed  applets  via  Intranet/Internet  enviromnents 
as  required.  The  immediate  focus  is  on  developing  inter- 
faces for  information  stored  in  distributed  heterogeneous 
databases  such  as  GDB,  GSDB,  Entry,  and  ACeDB.  The 
issues  we  are  addressing  are  database  access,  reflecting 
database  schemas  in  bioWidgets,  and  performance.  We  are 
also  directing  our  efforts  into  creating  a  consortium  of 
bioWidget  developers  and  end-users.  This  organization 
will  create  standards  for  and  encourage  the  development  of 
bioWidget  components.  Primary  participants  in  the  consor- 
tium include  Gerry  Rubin  (UC  Berkeley)  and  Nat 
Goodman  (Jackson  Labs). 

DOE  Grant  No.  DE-FG02-92ER61371. 
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Bayesian  estimates  for  sequence  similarity:  There  is  an 
inherent  relationship  between  the  process  of  pairwise  se- 
quence alignment  and  the  estimation  of  evolutionary  dis- 
tance. This  relationship  is  explored  and  made  explicit.  As- 
suming an  evolutionary  model  and  given  a  specific  pattern 
of  observed  base  mismatches,  the  relative  probabilities  of 
evolution  at  each  evolutionary  distance  are  computed  us- 
ing a  Bayesian  framework.  The  mean  or  the  median  of  this 
probability  distribution  provides  a  robust  estimate  of  the 
central  value.  Bayesian  estimates  of  the  evolutionary  dis- 
tance incorporate  arbitrary  prior  information  about  variable 
mutation  rates  both  over  time  and  along  sequence  position. 
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thus  requiring  only  a  weak  form  of  the  molecular-clock 
hypothesis. 

The  endpoints  of  the  similarity  between  genomic  DNA 
sequences  are  often  ambiguous.  The  probability  of  evolu- 
tion at  each  evolutionary  distance  can  be  estimated  over 
the  entire  set  of  alignments  by  choosing  the  best  alignment 
at  each  distance  and  the  corresponding  probability  of  du- 
plication at  that  evolutionary  distance.  A  central  value  of 
this  distribution  provides  a  robust  evolutionary  distance 
estimate.  We  provide  an  efficient  algorithm  for  computing 
the  parametric  alignment,  considering  evolutionary  dis- 
tance as  the  only  parameter. 

These  techniques  and  estimates  are  used  to  infer  the  dupli- 
cation history  of  the  genomic  sequence  in  C.  elegans  and 
in  S.  cerevisae.  Our  results  indicate  that  repeats  discovered 
using  a  single  scoring  matrix  show  a  considerable  bias  in 
subsequent  evolutionary  distance  estimates. 

Model  based  sequence  scoring  metrics:  PAM  based 
DNA  comparison  metric  has  been  extended  to  incorporate 
biases  in  nucleotide  composition  and  mutation  rates,  ex- 
tending earlier  work  (States,  Gish  and  Altschul,  1993).  A 
codon  based  scoring  system  has  been  developed  that  incor- 
porates the  effects  biased  codon  utilization  frequencies. 

A  dynamic  programming  algorithm  has  been  developed 
that  will  optimally  align  sequences  using  a  choice  of  com- 
parison measures  (non-coding  vs.  coding,  etc.).  We  are  in 
the  process  of  evaluating  this  approach  as  a  means  for 
identifying  likely  coding  regions  in  cDNA  sequences. 

Efficient  sequence  similarity  search  tools:  Most  se- 
quence search  tools  have  been  designed  for  use  with  pro- 
tein sequence  queries  a  few  himdred  residues  long.  The 
analysis  of  genomic  DNA  sequence  necessitates  the  use  of 
queries  hundreds  of  kilobases  or  even  megabases  in  length. 
A  memory  and  computationally  efficient  search  tool  has 
been  developed  for  the  identification  of  repeats  and  se- 
quence similarity  in  very  large  segments  of  nucleic  acid 
sequence.  The  tool  implements  optimal  encoding  of  the 
word  table,  repeat  filters,  flexible  scoring  systems,  and 
analytically  parameterized  search  sensitivity.  Output  for- 
mats are  designed  for  the  presentation  of  genomic  se- 
quence searches. 

Federated  databases:  A  Sybase  server  and  mirror  for 
GSDB  are  being  developed  to  facilitate  the  annotation  of 
repeat  sequence  elements  in  public  data  repositories. 

DOE  Grant  No.  DE-FG02-94ER61910. 
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GRAIL  is  a  modular  expert  system  for  the  analysis  and 
characterization  of  DNA  sequences  which  facilitates  the 
recognition  of  gene  features  and  gene  modeling.  A  new 
version  of  the  system  has  been  created  with  greater  sensi- 
tivity for  exon  prediction  (especially  in  AT  rich  regions), 
more  accurate  splice  site  prediction,  and  robust  indel  error 
detection  capability.  GRAIL  1.3  is  available  to  the  user  in 
a  Motif  graphical  client-server  system  (XGRAIL),  through 
WWW-Netscape,  by  e-mail  server,  or  callable  from  other 
analysis  programs  using  Unix  sockets. 

In  addition  to  the  positions  of  protein  coding  regions  and 
gene  models,  the  user  can  view  the  positions  of  a  number 
of  other  features  including  poly-A  addition  sites,  potential 
Pol  II  promoters,  CpG  islands  and  both  complex  and 
simple  repetitive  DNA  elements  using  algorithms  devel- 
oped at  ORNL.  XGRAIL  also  has  a  direct  link  to  the 
genQuest  server,  allowing  characterization  of  newly  ob- 
tained sequences  by  homology-based  methods  using  a 
number  of  protein,  DNA,  and  motif  databases  and  com- 
parison methods  such  as  FastA,  BLAST,  parallel 
Smith-Waterman,  and  special  algorithms  which  consider 
potential  frameshifts  during  sequence  comparison. 

Following  an  analysis  session,  the  user  can  use  an  annota- 
tion tool  which  is  part  of  the  XGRAIL  1.3  system  to  gener- 
ate a  "feature  table"  report  describing  the  current  sequence 
and  its  properties.  Links  to  the  GSDB  sequence  database 
have  been  established  to  record  computer-based  analysis 
of  sequences  during  submission  to  the  database  or  as  third 
party  annotation. 

Gene  Modeling  and  Client-Server  GRAIL:  In  addition 
to  the  current  coding  region  recognition  capabilities  based 
on  a  multiple  sensor-neural  network  and  rule  base,  mod- 
ules for  the  recognition  of  features  such  as  splice  junc- 
tions, transcription  and  translation  start  and  stop,  and  other 
control  regions  have  been  constructed  and  incorporated 
into  an  expert  system  (GAP  III)  for  reliable 
computer-based  modeling  of  genes.  Heuristic  methods  and 
dynamic  programming  are  used  to  construct  fu^t  pass  gene 
models  which  include  the  potential  for  modification  of  ini- 
tially predicted  exons.  These  actions  result  in  a  net  im- 
provement in  gene  characterization,  particularly  in  the  rec- 
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ognition  of  very  short  coding  regions.  Translation  of  gene 
models  and  database  searches  are  also  supported  through 
access  to  the  genQuest  server  (described  below). 

Model  Organism  Systems:  A  number  of  model  organism 
systems  have  been  designed  and  implemented  and  can  be 
accessed  within  the  XGRAIL  1.3  client  including  Escheri- 
chia coli,  Drosophila  melanogaster  and  Arabidopsis 
thaliana.  The  performance  of  these  systems  is  basically 
equivalent  to  the  Human  GRAIL  1.3  system.  Additional 
model  organism  systems,  including  several  important  mi- 
croorganisms, are  in  progress. 

Error  Detection  in  Coding  Sequences:  Single-pass  DNA 
sequencing  is  becoming  a  widely  used  technique  for  gene 
identification  from  both  cDNA  and  genomic  DNA  se- 
quences. An  appreciably  higher  rate  of  base  insertion  and 
deletion  errors  (indels)  in  this  type  of  sequence  can  cause 
serious  problems  in  the  recognition  of  coding  regions,  ho- 
mology search,  and  other  aspects  of  sequence  interpreta- 
tion. We  have  developed  two  error  detection  and  "correc- 
tion" strategies  and  systems  which  make  low -redundancy 
sequence  data  more  informative  for  gene  identification  and 
characterization  purposes.  The  furst  algorithm  detects  se- 
quencing errors  by  finding  changes  in  the  statistically  pre- 
ferred reading  frame  within  a  possible  coding  region  and 
then  rectifies  the  frame  at  the  transition  point  to  make  the 
potential  exon  candidate  frame-consistent.  We  have  incor- 
porated this  system  in  GRAIL  L3  to  provide  analysis 
which  is  very  error  tolerant.  Currently  the  system  can  de- 
tect about  70%  of  the  indels  with  an  indel  rate  of  1%,  and 
GRAIL  identifies  89%  of  the  coding  nucleotides  compared 
to  69%  for  the  system  without  error  correction.  The  algo- 
rithm uses  dynamic  programming  and  runs  in  time  and 
space  linear  to  the  size  of  the  input  sequence. 

In  the  second  method,  a  Smith-Waterman  type  comparison 
is  facilitated  in  which  the  frame  of  DNA  translation  to  pro- 
tein sequence  can  change  within  the  sequence.  The  transi- 
tion points  in  the  translation  frame  are  determined  during 
the  comparison  process  and  a  best  match  to  potential  pro- 
tein homologs  is  obtained  with  sections  of  translations 
from  more  than  one  frame.  The  algorithm  can  detect  ho- 
mologies with  a  sensitivity  equivalent  to  Smith-Waterman 
in  the  presence  of  5%  indel  errors. 

Detection  of  Regulatory  Regions:  An  initial  Polymerase 
II  promoter  detection  system  has  been  implemented  which 
combines  individual  detectors  for  TATA,  CAAT,  GC,  cap, 
and  translation  start  elements  and  distance  information  us- 
ing a  neural  network.  This  system  finds  about  67%  of 
TATA  containing  promoters  with  a  false  positive  rate  of 
one  per  35  kilobases.  Additionally  a  systems  to  detect  po- 
tential polyA  addition  sites  and  CpG  islands  has  been  in- 
corporated into  GRAIL. 

The  GenQuest  Sequence  Comparison  Server  The 

genQuest  server  is  an  integrated  sequence  comparison 
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server  which  can  be  accessed  via  e-mail,  using  Unix  sock- 
ets ftom  other  applications,  Netscape,  and  through  a  Motif 
graphical  client-server  system.  The  basic  purpose  of  the 
server  system  is  to  facilitate  rapid  and  sensitive  compari- 
son of  DNA  and  protein  sequences  to  existing  DNA,  pro- 
tein, and  motif  databases.  Databases  accessed  by  this  sys- 
tem include  the  daily  updated  GSDB  DNA  sequence  data- 
base, SwissProt,  the  dbEST  expressed  sequence  tag  data- 
base, protein  motif  libraries  and  motif  analysis  systems 
(Prosite,  BLOCKS),  a  repetitive  DNA  library  (ftom  J. 
Jurka),  Genpept,  and  sequences  in  the  PDB  protein  struc- 
tural database.  These  options  can  also  be  accessed  from  the 
XGRAIL  graphical  client  tool. 

The  genQuest  server  supports  a  variety  of  sequence  query 
types.  For  searching  protein  databases,  queries  may  be  sent 
as  amino  acid  or  DNA  sequence.  DNA  sequence  can  be 
translated  in  a  user  specified  frame  or  in  all  6  frames. 
DNA-DNA  searches  are  also  supported.  User  selectable 
methods  for  comparison  include  the  Smith-Waterman  dy- 
namic programming  algorithm,  FastA,  versions  of  BLAST, 
and  the  IBM  dFLASH  protein  sequence  comparison  algo- 
rithm. A  variety  of  options  for  search  can  be  specified  in- 
cluding gap  penalties  and  option  switches  for 
Smith -Waterman,  FastA,  and  BLAST,  the  number  of  align- 
ments and  scores  to  be  reported,  desired  target  databases 
for  query,  choice  of  PAM  and  Blosum  matrices,  and  an 
option  for  masking  out  repetitive  elements.  Multiple  target 
databases  can  be  accessed  within  a  single  query. 

Additional  Interfaces  and  Access:  Batch  GRAIL  1.3  is  a 
new  "batch"  GRAIL  client  allows  users  to  analyze  groups 
of  short  (300-400  bp)  sequences  for  coding  character  and 
automates  a  wide  choice  of  database  searches  for  homol- 
ogy and  motifs.  A  Command  Line  Sockets  Client  has  been 
constructed  which  allows  remote  programs  to  call  all  the 
basic  analysis  services  provided  by  the  GRAIL-genQuest 
system  without  the  need  to  use  the  XGRAIL  interface. 
This  allows  convenient  integration  of  selected  GRAIL 
analyses  into  automated  analysis  pipelines  being  con- 
structed at  some  genome  centers.  An  XGRAIL  Motif 
Graphical  Client  for  the  GRAIL  release  1.3  has  been  con- 
structed using  Motif  with  versions  for  a  wide  variety  of 
UNDC  platforms  including  Sun,  Dec,  and  SGI.  The  e-mail 
version  of  GRAIL  can  be  accessed  at  grail@oml.gov  and 
the  e-mail  version  of  genCJuest  can  be  accessed  at 
Q@oml.gov.  Instructions  can  be  obtained  by  sending  the 
word  "help"  to  either  address.  The  Motif  or  Sun  versions 
of  XGRAIL,  batch  GRAIL,  and  XgenQuest  client  software 
are  available  by  anonymous  ftp  from  grailsrv.lsd.oml.gov 
(124.167.140.21).  Both  GRAIL  and  genQuest  are  accessible 
over  the  World  Wide  Web  (URL  http://compbio.oml.gov). 
Conunimications  with  the  GRAIL  staff  should  be  ad- 
dressed to  GRAILMAIL@oml.gov. 

DOE  Contract  No.  DE-AC05-840R21400. 
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The  purpose  of  this  project  is  to  develop  databases  and 
tools  for  the  Oak  Ridge  National  Laboratory  (ORNL) 
Mouse-Human  Mapping  Project,  including  the  construc- 
tion of  a  mapping  database  for  the  project;  tools  for  man- 
aging and  archiving  cDNAs  and  other  probes  used  in  the 
laboratory;  and  analysis  tools  for  mapping,  interspecific 
backcross,  and  other  needs.  Our  initial  effort  involved  in- 
stalling and  developing  a  relational  SYBASE  database  for 
tracking  samples  and  probes,  experimental  results,  and 
analyses.  Recent  work  has  focused  on  a  corresponding 
ACeDB  implementation  containing  mouse  mapping  data 
and  providing  numerous  graphical  views  of  this  data.  The 
initial  relational  database  was  constructed  with  SYBASE 
using  a  schema  modeled  on  one  implemented  at  the 
Lawrence  Livermore  National  Laboratory  (LLNL)  center; 
this  was  because  of  documentation  available  for  the  LLNL 
system  and  the  opportunity  to  maximize  compatibility  with 
hiunan  chromosome  19  mapping.  (Major  homologies  exist 
between  human  chromosome  19  and  mouse  chromosome 
7,  the  initial  focus  of  the  ORNL  work.) 

With  some  modification,  our  ACeDB  implementation  was 
modeled  somewhat  on  the  Lawrence  Berkeley  National 
Laboratory  (LBNL)  chromosome  21  ACeDB  system  and 
designed  to  contain  genetic  and  physical  mouse  map  data 
as  well  as  homologous  human  chromosome  data.  The  use- 
fulness of  exchanging  map  information  with  LLNL  (hu- 
man chromosome  19)  and  potentially  with  other  centers 
has  led  to  the  implementation  of  procedures  for  data  export 
and  the  import  of  human  mapping  data  into  ORNL  data- 
bases. 

User  access  to  the  system  is  being  provided  by  workstation 
forms-based  data  entry  and  ACeDB  graphical  data  brows- 
ing. We  have  also  implemented  the  LLNL  database 
browser  to  view  human  chromosome  19  data  maintained  at 
LLNL,  and  arrangements  are  being  made  to  incorporate 
mouse  mapping  information  into  the  browser.  Other  appli- 
cations such  as  the  Encyclopedia  of  the  Mouse,  specific 
tools  for  archiving  and  tracking  cDNAs  and  other  mapping 
probes,  and  analysis  of  interspecific  backcross  data  and 
YAC  restriction  mapping  have  been  implemented. 

We  would  like  to  acknowledge  use  of  ideas  from  the 
LLNL  and  LBNL  Human  Genome  Centers. 

DOE  Contract  No.  DE-AC05-840R21400. 
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Making  information  generated  by  the  various  genome 
projects  available  to  the  community  is  very  important  for 
the  researcher  submitting  data  and  for  the  overall  project  to 
justify  the  expenses  and  resources.  Public  genome  data- 
bases generally  provide  a  protocol  that  defines  the  required 
data  formats  and  details  how  they  accept  data,  e.g.,  se- 
quences, mapping  information.  These  protocols  have  to 
strike  a  balance  between  ease  of  use  for  the  user  and  op- 
erational considerations  of  the  database  provider,  but  are  in 
most  cases  rather  complex  and  subject  to  change  to  accom- 
modate modifications  in  the  database. 

SubmitData  is  a  user  interface  that  formats  data  for  sub- 
mission to  GSDB  or  GDB.  The  user  interface  serves  data 
entry  purposes,  checking  each  field  for  data  types,  allowed 
ranges  and  controlled  values,  and  gives  the  user  feedback 
on  any  problems.  Besides  one-time  submissions,  templates 
can  be  created  that  can  later  be  merged  with 
TAB-delimited  data  files,  e.g.,  as  produced  by  common 
spreadsheet  programs.  Variables  in  the  template  are  then 
replaced  by  values  in  defined  columns  of  the  input  data 
file.  Thus  submitting  large  amounts  of  related  data  be- 
comes as  easy  as  selecting  a  format  and  supplying  an  input 
filename.  This  allows  easy  integration  of  data  submission 
into  the  data  generation  process. 

The  interface  is  generated  directly  from  the  protocol  speci- 
fications. A  specific  parser/compiler  interprets  the  protocol 
definitions  and  creates  internal  objects  that  form  the  basis 
of  the  user  interface.  Thus  a  working  user  interface,  i.e., 
static  layout  of  buttons  and  fields,  data  validation,  is  auto- 
matically generated  from  the  protocol  definitions.  Protocol 
modifications  are  propagated  by  simply  regenerating  the 
interface. 

The  program  has  been  developed  using  ParcPlace 
Visual  Works  and  currently  supports  GSDB,  GDB  and 
RHdb  data  submissions.  The  program  has  been  updated  to 
use  VisualWorks  2.0. 

DOE  Contract  No.  DE-AC03-76SF00098. 
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From  April  through  September  1995,  the  Exploratorium 
mounted  a  special  exhibition  called  Diving  into  the  Gene 
Pool  consisting  of  26  interactive  exhibits  developed  over 
the  course  of  three  years.  The  exhibits  introduce  the  science 
of  genetics  and  increase  public  awareness  of  the  Human 
Genome  Project  and  its  implications  for  society.  Founded 
in  the  success  of  exhibits  developed  for  the  1992  genetics 
and  biotechnology  symposium  "Winding  Your  Way  Through 
DNA"  (co-hosted  with  the  University  of  California,  San 
Francisco),  the  1995  exhibition  aimed  to  create  an  engag- 
ing and  accessible  presentation  of  specific  information 
about  genetic  science  and  our  understanding  of  the  struc- 
ture and  function  of  the  human  genome,  genetic  technol- 
ogy, and  ethical  issues  surrounding  current  genetic  science. 

In  addition  to  creating  a  unique  collection  of  exhibits,  the 
project  developed  a  range  of  supplemental  public  program- 
ming to  provide  public  forum  for  discussion  and  interac- 
tion about  genetics  and  bioethics.  A  lecture  series  entitled 
"Bioethics  and  the  Hiunan  Genome  ProjecC  featured  such 
key  thinkers  as  Mary  Claire  King,  Leroy  Hood,  David 
Martin,  Troy  Duster,  Michael  Yesley,  William  Atchley,  and 
Joan  Hamilton  (among  others).  A  weekend  event  program 
focused  on  biodiversity  in  animal  and  plant  life  with 
events  such  as  "Seedy  Science,"  "Blooming  Genes,"  and 
"Dog  Diversity."  A  Biotech  Weekend  offered  access  to 
new  technologies  through  demonstrations  by  local  biotech 
firms  and  genetic  counselors.  And  a  specially-commis- 
sioned theatre  piece,  "Dog  Tails,"  provided  a  instructive 
and  comic  look  for  kids  into  the  foundations  of  genetics 
and  issues  of  diversity. 

In  the  5-month  exhibition  period,  approximately  300,000 
visitors  had  the  opportunity  to  visit  the  exhibition,  and 
well  over  5,000  participated  in  the  special  programming. 
Following  the  exhibition's  close,  the  new  exhibits  will  be- 
come a  permanent  part  of  the  Exploratorium's  collection 
of  over  650  interactive  exhibits. 

Additional  funding  for  1995-96  will  support  formal  outside 
evaluation  of  the  effectiveness  of  the  exhibits,  and  support 
exhibit  remediation  based  on  the  evaluation  findings.  This 
activity  will  both  strengthen  the  Exploratorium's  permanent 
collection  of  genetics  exhibits  and  help  to  develop  a  feasi- 
bility study  for  a  travelling  version  of  the  genetics  exhibi- 
tion for  other  museums  around  the  country  and  the  world. 

DOE  Grant  No.  DE-FG03-93ER61583. 


Documentary  Series  for  Public 
Broadcasting 

Graham  Chedd  and  Noel  Schwerin 

Chedd-Angier  Production  Company;  Watertown,  MA 

02172 

617/926-8300,  Fax:  -2710 

Designed  as  a  4-hour  docimientary  series  for  Public 
Broadcasting,  Genetics  in  Society  (working  title)  will  ex- 
plore the  ethical,  legal,  and  social  implications  of  genetic 
technology.  Currently  funded  and  in  production  for  a  90- 
minute  special  (Testing  Family  Ties),  the  first  program  pro- 
files several  individuals  and  families  as  they  confront  ge- 
netic tests  and  the  information  they  generate.  One  high- 
risk  cancer  family  struggles  to  make  sense  of  their  genetic 
legacy  as  it  debates  prophylactic  surgery  and  whether  or 
not  to  test  for  BRCA 1  and  BRCA2.  In  a  family  without  that 
family  risk,  news  of  the  Ashkenazi  BRCA  1  finding  pushes 
an  anxious  Jewish  woman  to  demand  testing  for  herself 
and  her  young  daughter  In  another,  a  woman  chooses  to 
carry  to  term  her  prenatally  diagnosed  Cystic  Fibrosis 
twins,  despite  social  and  personal  pressures.  In  a  third,  a 
scientist  researching  the  so-called  "obesity  gene"  at  a 
biotech  company  debates  the  proper  "marketing"  of  his 
research  and  confronts  the  larger  questions  it  raises  about 
what  should  be  considered  "normal"  and  what  constitutes 
therapy  vs  enhancement. 

Testing  Family  Ties  will  explore  not  only  what  genetic 
technology  does — in  testing,  drug  development,  and  po- 
tential therapy — but  what  it  means  to  our  sense  of  self, 
family,  and  future  and  to  our  concepts  of  health  and  nor- 
mality. 

Depending  on  outstanding  funding  requests.  Genetics  in 
Society  will  be  broadcast  in  the  Fall  of  1996  or  the  Winter 
of  1997  on  PBS.  Noel  Schwerin  is  Producer/Director.  Gra- 
ham Chedd  is  Executive  Producer. 

DOE  Grant  No.  DE-FG06-95ER6I995. 

Human  Genome  Teacher  Networking 
Project 

Debra  L.  Collins  and  R.  Nell  Schimke 

Genetics  Education  Center;  Division  of  Endocrinology  and 

Genetics;  University  of  Kansas  Medical  Center;  Kansas 

City,  KS  66160-7318 

913/588-6043,  Fax:  ^060,  collins@ukanvm.cc.ukans.edu 

http://www.kumc.edu/GEC 

This  project  links  over  150  middle  and  secondary  teachers 
from  throughout  the  United  States  with  genetic  and  public 
policy  professionals,  as  well  as  families  who  are  knowl- 
edgeable about  the  ethical,  legal,  and  social  implications 
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(ELSI)  of  the  Human  Genome  Project  Teachers  network 
with  peers  and  professionals,  and  acquire  new  sources  of 
information  during  four  phases:  I )  the  first  one-week  sum- 
mer workshop  to  update  teachers  on  human  genetics  con- 
cepts and  new  sources  for  classroom  curricula  including 
online  resources;  2)  classroom  use  of  new  materials  and 
information;  3)  the  second  one-week  summer  workshop 
where  teachers  return  to  exchange  successful  teaching 
ideas  and  plan  peer  teaching  sessions  and  mentor  network- 
ing; 4)  dissemination  of  genetic  information  through 
in-services  and  workshops  for  colleagues;  and  collabora- 
tion with  genetic  professional  participating  in  our  Mentor 
Network. 

The  applications  of  Human  Genome  Project  technology 
are  emphasized.  Individuals  who  have  contact  and  experi- 
ence with  patients,  including  clinical  geneticists,  genetic 
counselors,  attorneys,  laboratories  geneticists  and  families, 
take  part  in  didactic  sessions  with  teachers.  Throughout  the 
workshop,  family  panels  provide  an  opportunity  for  par- 
ticipants to  compare  their  textbook-based  knowledge  of 
genetic  conditions  with  the  personal  experiences  of  fami- 
lies who  discuss  their  condition,  including:  diagnosis, 
treatment,  genetic  risk,  decisions,  insurance,  employment, 
family  planning,  and  confidentiality. 

Because  of  this  project,  teachers  feel  more  prepared  and 
confident  teaching  about  human  genetics,  the  Human  Ge- 
nome Project  and  ELSI  topics.  The  teachers  are  effective 
in  disseminating  knowledge  of  genetics  to  their  students 
who  show  a  significant  increase  in  human  genome  knowl- 
edge compared  to  students  whose  teachers  have  not  par- 
ticipated in  this  project 

Teacher  dissemination  activities  extend  the  project  beyond 
participation  at  summer  workshops.  To  date,  55  workshop 
participants  have  completed  all  four  project  phases  by  or- 
ganizing more  than  200  local,  regional,  and  national 
teacher  education  programs  to  disseminate  knowledge  and 
resources.  More  than  1500  colleagues  and  the  general  pub- 
lic have  participated  in  teacher  workshops,  and  over 
56,000  students  have  been  reached  through  project  partici- 
pants and  their  peers. 

The  project  participants  organize  interdisciplinary  peer 
teaching  sessions  including  bioethical  decision  making 
sessions  combining  debate  and  biology  classes;  sessions 
for  social  studies  teachers;  human  genetics  and 
multi-cultural  collaborations;  cooperative  learning  activi- 
ties; and  curricular  development  sessions.  Students  were 
involved  in  sessions  on  ethics,  politics,  economics  and  law. 
Teachers  organize  bioethics  curriculum  writing  sessions, 
laboratory  activities  using  electrophoresis  as  well  as  other 
biotechnology,  and  sessions  on  genetic  databases. 

A  World  Wide  Web  home  page  for  Genetics  Education  as- 
sists teachers  in  remaining  current  on  genetic  information 
and  helps  them  find  answers  to  student  inquiries.  The 


home  page  has  links  to  numerous  genome  sites,  sources  of 
information  on  genetic  conditions,  networking  opportuni- 
ties with  other  genetics  education  programs,  teaching  re- 
sources, lesson  plan  ideas,  and  the  Mentor  Network  of  ge- 
netic professionals  and  a  network  of  family  support  groups 
willing  to  work  with  teachers  and  their  students. 

DOE  Grant  No.  DE-FG02-92ER6I392. 

Human  Genome  Education  Program 

Lane  Conn 

Human  Genome  Education  Program;  Stanford  Human 
Genome  Center;  Palo  Alto,  CA  94304 
415/812-2003.  Fax:  -1916,  tconn@toolik.stanford.edu 

The  Human  Genome  Education  Program  (HGEP)  operates 
within  the  Stanford  Human  Genome  Center.  It  is  a  collabo- 
rative effort  among  HGEP  staff.  Genome  Center  scientists, 
collaborating  staff  ftom  other  education  programs,  experi- 
enced high  school  teachers,  and  an  Advisory  Panel  in  the 
fields  of  science,  education,  social  science,  assessment 
and  ethics. 

The  Human  Genome  Project  will  have  a  profound  impact 
on  society  with  its  applications  in  testing  for  and  improv- 
ing treatment  of  genetic  disea.se  and  the  many  uses  of 
DNA  profiling.  The  goal  of  HGEP  is  to  help  prepare  high 
school  students  and  community  members  to  be  able  to 
make  educated  decisions  on  the  personal,  ethical,  social 
and  policy  questions  raised  by  the  application  of  genome 
information  and  technology  in  their  lives. 

The  primary  objectives  for  HGEP  are  to  (1)  develop  a  hu- 
man genome  curriculum  for  high  school  science  and  (2) 
education  outreach  to  schools  and  community  groups  in 
the  San  Francisco  Bay  Area.  To  achieve  Objective  1,  the 
HGEP  is  working  to  develop,  field  test  and  prepare  for 
national  dissemination  a  two  laboratory-based  curriculum 
units  for  high  school  students.  Unit  1,  "Dealing  With  Ge- 
netic Disorders,"  explores  the  variety  of  treatment  options 
potentially  available  for  a  genetic  disorder,  including  gene 
therapy.  Unit  2,  "DNA  Snapshots,  Peeking  at  Your  DNA," 
explores  human  relatcdness  through  examining  the 
student's  own  DNA  polymorphisms  using  PCR. 

Each  unit  is  centered  around  a  societal  or  ethical  problem 
raised  by  these  important  applications  of  genome  informa- 
tion and  technology.  Students  use  modeling  exercises  and 
inquiry  laboratory  experiments  to  learn  about  the  science 
behind  a  given  application.  Students  then  combine  the  sci- 
ence they  have  learned  with  other  relevant  information  to 
choose  a  solution  to  the  societal/ethical  problem  posed  in 
the  unit.  As  a  culminating  activity,  the  students  work  in 
groups  to  present  and  defend  their  solution. 
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To  achieve  Objective  2,  the  HGEP  provides  Genome  Cen- 
ter tours  for  teacher,  student  and  community  groups  that 
involve  pre-tour  lectures;  tour  exploration  of  genome  map- 
ping, sequencing  and  informatics;  and  post-tour  lectiu-e 
and  discussion  on  genome  applications,  and  their  social 
and  ethical  implications.  Also,  the  education  program  con- 
tinues to  work  to  establish  and  sustain  local  science  educa- 
tion partnerships  among  schools,  industry,  universities  and 
national  laboratories. 

DOE  Grant  No.  DE-FG03-96ER62161. 

Your  World/Our  World-Biotechnology  & 
You:  Special  Issue  on  the  Human 
Genome  Project 

JefT  Davidson  and  Laurence  Weinberger 

Pennsylvania  Biotechnology  Association;  State  College, 

PA  16801 

814/238-4080,  Fax:  -4081,  73I50.1623@compuserve.com 

Your  World/Our  World  is  a  biotechnology  science  maga- 
zine published  semi-annually  by  the  non-profit  Pennsylva- 
nia Biotechnology  Association  (PBA)  describing  for  sev- 
enth to  tenth  grade  students  the  excitement  and  achieve- 
ments of  contemporary  biotechnology.  This  is  the  only 
continuing  source  of  biotechnology  education  specifically 
directed  to  this  age  group  -  an  age  at  which  students  too 
frequently  are  mmed  off  from  science.  The  special  Spring 
1996  issue  will  be  devoted  to  the  presentation  of  the  sci- 
ence behind  the  HGP,  the  HGP  itself,  and  the  ethical,  legal, 
and  social  issues  generated  by  the  project.  The  strong  em- 
phasis on  attractive  graphic  presentation  and  age  appropri- 
ate text  that  have  been  the  hallmark  of  the  earlier  issues, 
which  have  been  highly  acclaimed  and  well  received  by 
the  educational,  scientific,  and  business  commiuiity,  will 
be  continued. 

PBA  believes  that  increased  educational  opportunities  to 
learn  about  biotechnology  are  most  effective  if  presented 
at  the  seventh  to  tenth  grade  levels  for  the  following  rea- 
sons: 

•  Full  semester  life  science  and  biology  classes  often 
occur  for  the  first  time  in  these  grades; 

•  Across  the  nation,  textbooks  are  typically  10  to  14 
years  old,  and  even  the  most  recent  textbooks  are 
quickly  dated  by  the  rapid  development  in  the  biologi- 
cal sciences; 

•  Curricula  at  this  level  are  more  flexible  than  high 
school  curricula,  allowing  the  addition  of  information 
about  exciting  biological  developments;  and 

•  Science  at  this  level  is  generally  not  elective,  and, 
therefore,  a  very  comprehensive  student  population  is 
addressed  rather  than  the  more  selective  populations 
available  later  in  the  educational  program. 


ELSI 

In  creating  Your  World/Our  World,  the  PBA  defined  the 
following  educational  goals  to  guide  the  development  of 
the  magazine: 

•  Contribute  to  general  science  literacy  and  an  educated 
electorate; 

•  Contribute  to  biological  and  technological  literacy; 
and 

•  Motivate  students  to  pursue  additional  science  study 
and  careers  in  science,  particularly  among  women  and 
minority  populations. 

PBA  recognizes  that  it  has  been  a  point  of  pride  that 
biotechnologists  have  been  uniquely  concerned  with  the 
impact  of  their  technology  on  society  and  have  been  the 
first  to  raise  and  encourage  responsible  public  debate  with- 
out being  forced  to  do  so  by  others.  To  do  less  now  for  the 
children  would  be  a  breach  of  this  responsible  history.  Ac- 
cordingly, this  special  HGP  issue  will  address  the  ethical, 
legal,  and  social  issues  raised  by  the  new  genomic  tech- 
nologies. Special  ethics  advisors  have  been  recruited  to  aid 
in  the  development  of  these  aspects. 

A  complimentary  copy  of  the  special  issue  and  its  teachers' 
guide  will  be  mailed  to  every  public  and  private  school 
seventh  to  tenth  grade  science  teacher  (approximately 
40,000)  in  the  United  States.  A  cover  announcement  will 
explain  the  origin  and  development  of  the  magazine  and  of 
the  special  edition.  Teachers  will  be  invited  to  piu-chase 
full  classroom  packets  (30  copies  &  teacher's  guide)  from 
the  PBA,  but,  if  they  are  not  able  to  afford  the  packets, 
they  will  be  asked  to  respond  by  postcard  indicating  their 
interest  The  cost  of  the  packets  will  probably  be  in  the  $20 
range.  The  PBA  is  actively  seeking  additional  support  so 
that  the  issue  may  be  distributed  for  free  or  at  a  reduced 
cost.  In  addition,  parts  of  the  special  issue  will  be  available 
over  the  Internet  via  a  World  Wide  Web  Page. 

PBA  believes  this  is  a  unique  opportunity  to  educate 
America's  youth  about  the  HGP  and  insure  that  accurate 
non-sensational  information  will  be  made  available  to  our 
country's  children. 

DOE  Grant  No.  DE-FG02-95ER62I07. 

The  Human  Genome  Project  and 
Mental  Retardation:  An  Educational 
Program 

Sharon  Davis 

Department  of  Research  and  Program  Services;  The  Arc 
of  the  United  States;  Arlington,  TX  76010 
8 17/261-6003,  Fax:  /277-3491,  sdavis@metronet.com 
http:/n^e  Arc.org/welcome.html 

The  Arc  of  the  United  States,  a  national  organization  on 
mental  retardation,  with  140,000  members  and  more  than 
1000  affiliated  chapters  proposes  to  educate  its  general 
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membership  and  volunteer  leaders  about  the  Human  Ge- 
nome Project  as  it  relates  to  mental  retardation.  A  large 
number  of  identified  causes  of  mental  retardation  are  ge- 
netic, and  many  family  members  of  The  Arc  deal  with  is- 
sues related  to  a  genetic  condition  on  a  daily  basis.  We  be- 
lieve it  is  critical  for  our  members  and  leaders  to  be  edu- 
cated about  the  scientific  and  ethical,  legal  and  social  as- 
pects of  the  HGP,  so  that  the  association  can  evaluate  and 
discuss  the  issues  and  develop  positions  based  on  adequate 
knowledge. 

The  major  objectives  of  the  proposed  three-year  project 
are  to  develop  and  disseminate  educational  materials  for 
members/leaders  of  The  Arc  to  inform  them  about  the  Hu- 
man Genome  Project  and  mental  retardation  and  to  con- 
duct training  on  the  scientific  and  ethical,  legal  and  social 
aspects  of  the  Human  Genome  Project  and  mental  retarda- 
tion using  The  Arc's  existing  training  vehicles. 

The  Arc  will  develop  and  disseminate  educational  materi- 
als oriented  toward  families  and  conduct  training  at  its  na- 
tional and  state  conventions,  local  chapter  meetings  and  at 
board  of  director's  meetings.  The  American  Association  of 
University  Affiliated  Programs  for  Persons  with  Develop- 
mental Disabilities  (AAUAP)  will  assist  with  the  project 
by  providing  needed  expertise.  The  AAUAP  membership 
includes  university  faculty  who  are  experts  on  the  genetic 
causes  of  mental  retardation  and  on  related  ethical,  legal 
and  social  issues.  An  advisory  panel  of  university  scientists 
and  leaders  of  The  Arc  will  guide  the  project. 

DOE  Grant  No.  DE-FG03-96ER62I62. 

Pathways  to  Genetic  Screening: 
Molecular  Genetics  Meets  the  High- 
Risk  Family 

Troy  Duster  and  Diane  Beeson' 

Institute  for  the  Study  of  Social  Change;  University  of 

California;  Berkeley,  CA  94705 

510/642-0813,  Fax:  /8674,  mtrogn@violet.berkeley.edu 

'Department  of  Sociology;  California  State  University; 

Hayward,  CA  94542 

The  proliferation  of  genetic  screening  and  testing  is  requir- 
ing increasing  numbers  of  Americans  to  integrate  genetic 
knowledge  and  interventions  into  their  family  life  and  per- 
sonal experience.  This  study  examines  the  social  processes 
that  occur  as  families  at  risk  for  two  of  the  most  common 
autosomal  recessive  diseases,  sickle  cell  disease  (SC)  and 
cystic  fibrosis  (CF),  encounter  genetic  testing.  Since  each 
of  these  diseases  is  found  primarily  in  a  different  ethnic/ 
racial  group  (CF  in  European  Americans  and  SC  is  African 
Americans),  this  research  will  clarify  the  role  of  culture  in 
integrating  genetic  testing  into  family  life  and  reproductive 
planning.  A  third  type  of  genetic  disorder,  the 


thalassemias,  has  recently  been  added  to  our  sample  in  or- 
der to  extend  our  comparative  frame  to  include  other  eth- 
nic and  racial  groups.  In  California,  the  thalassemias  pri- 
marily affect  Southeast  Asian  immigrants,  although  an- 
other risk  group  is  from  the  Mediterranean  region. 
Thalassemias,  like  cystic  fibrosis  and  sickle  cell  disease, 
have  a  similar  pattern  of  inheritance  and  raise  similarly 
serious  bio-medical  challenges  and  issues  of  information 
management. 

Data  are  drawn  from  interviews  with  members  of  families 
in  which  a  gene  for  CF,  SC  or  thalassemia  has  been  identi- 
fied. Data  collection  consists  primarily  of  focused  inter- 
views with  approximately  400  individuals  from  families  in 
which  at  least  one  member  has  been  identified  as  having  a 
genetic  disorder  (or  trait).  In  the  most  recent  phase  of  the 
research,  we  are  conducting  focus  groups  selected  to 
achieve  stratified  homogeneity  around  key  social  dimen- 
sions such  as  gender  and  relationship  to  disease.  This  is 
clarifying  the  social  processes  that  facilitate  and  inhibit 
genetic  testing. 

We  are  currently  assessing  the  concerns  expressed  by  re- 
spondents about  the  potential  uses  of  genetic  information. 
We  find  strong  patterns  of  concern,  often  based  on  per- 
sonal experience,  that  genetic  information  may  be  used  in 
ways  that  family  members  perceive  as  dangerous  and/or 
discriminatory.  First  among  these  concerns  is  fear  of  losing 
access  to  health  care.  Additional  concerns  include  fear  of 
genetic  discrimination  in  employment  and  other  types  of 
insurance,  particularly  life  insurance.  Similar  patterns  of 
concern  exist  among  members  of  each  ethnic  group,  and 
are  frequently  the  focus  of  attention  among  family  mem- 
bers, but  take  somewhat  different  form  within  each  cul- 
tural group.  These  concerns  constitute  a  growing  obstacle 
to  widespread  use  of  genetic  testing. 

DOE  Grant  No.  DE-FG03-92ER61393. 


Intellectual  Property  Issues  in 
Genomics 

Rebecca  S.  Eisenberg 

University  of  Michigan  Law  School;  Ann  Arbor,  MI  48 109 
313/763-1372,  Fax:  -9375,  r.ie@umich.edu 

Intellectual  property  issues  have  been  uncommonly  salient 
in  the  recent  history  of  advances  in  genomics.  Beginning 
with  the  filing  of  patent  applications  by  NTH  on  the  first 
batch  of  expressed  sequence  tags  (ESTs)  from  the  labora- 
tory of  Dr.  Craig  Venter,  each  new  development  has  been 
met  with  speculation  about  its  strategic  significance  from 
an  intellectual  property  perspective.  Are  ESTs  of  unknown 
function  patentable,  or  is  further  work  necessary  before 
they  satisfy  patent  law  standards?  Will  patents  on  such 
fragments  promote  commercial  investment  in  product  de- 
velopment, or  will  they  interfere  with  scientific  communi- 
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cation  and  collaboration  and  retard  the  overall  research 
effort?  Without  patent  rights,  how  may  the  owners  of  pri- 
vate cDNA  sequence  databases  earn  a  return  on  their  in- 
vestment while  still  permitting  other  investigators  to  obtain 
access  to  the  information  on  reasonable  terms?  What  are 
the  rights  of  those  who  contribute  resources  such  as  cDNA 
libraries  that  are  used  to  create  the  databases,  and  of  those 
who  identify  sequences  of  interest  out  of  the  morass  of 
information  in  the  databases  by  formulating  appropriate 
queries?  Will  the  disclosure  of  ESTs  in  the  public  domain 
preclude  patenting  of  subsequently  characterized 
full-length  genes  and  gene  products?  And  why  would  a 
commercial  firm  invest  its  own  resources  in  generating  an 
EST  database  for  the  public  domain? 

Two  factors  have  contributed  to  the  fascination  with  intel- 
lectual property  in  this  setting.  First  is  a  perception  that 
some  pioneers  in  genomics  have  sought  to  claim  intellec- 
tual property  rights  that  reach  beyond  their  actual  achieve- 
ments to  cover  future  discoveries  yet  to  be  made  by  others. 
For  example,  the  controversial  NIH  patent  applications 
claimed  rights  not  only  in  the  ESTs  that  were  actually  set 
forth  in  the  specifications,  but  also  in  the  full-length 
cDNAs  that  might  be  obtained  by  using  the  ESTs  as 
probes,  as  well  as  in  other,  undisclosed  fragments  of  those 
genes.  More  recently,  private  owners  of  cDNA  sequence 
databases  have  set  as  a  condition  for  access  agreement  to 
offer  the  database  owners  licenses  to  any  resulting  intellec- 
tual property.  These  efforts  to  claim  rights  to  the  future 
discoveries  of  others  raise  issues  about  the  fairness  and 
efficiency  of  the  law  in  allocating  rewards  and  incentives 
along  the  path  of  cumulative  innovation. 

Second  is  the  counterintuitive  alignment  of  interests  in  the 
debate.  It  was  a  public  institution,  NIH,  that  initially  fa- 
vored patenting  discoveries  that  some  representatives  of 
industry  thought  should  remain  unpatented,  and  it  was  a 
major  pharmaceutical  fum,  Merck  &  Co.,  that  ultimately 
took  upon  itself  the  quasi-governmental  function  of  spon- 
soring a  university-based  effort  to  place  comparable  infor- 
mation in  the  public  domain.  These  topsy-turvy  positions 
in  the  public  and  private  sectors  raise  intriguing  questions 
about  the  proper  roles  of  government  and  industry  in 
genomics  research,  and  about  who  stands  to  benefit  (and 
who  stands  to  lose)  from  the  private  appropriation  of  ge- 
nomic information. 

DOE  Grant  No.  DE-FG02-94ER61792. 


AAAS  Congressional  Fellowship 
Program 

Stephen  Goodman 

The  American  Society  of  Human  Genetics;  Bethesda,  MD 

20814-3998 

301/571-1825,  Fax:  /530-7079,  society@genetics.faseb.org 

Few  individuals  in  the  genetics  community  are  conversant 
with  federal  mechanisms  for  developing  and  implementing 
policy  on  human  genetics  research.  In  1 995  the  American 
Society  of  Human  Genetics  (ASHG),  in  conjunction  with 
OOE,  initiated  an  American  Association  for  the  Advance- 
ment of  Science  (AAAS)  Congressional  Fellowship  Pro- 
gram to  strengthen  the  dialogue  between  the  professional 
genetics  community  and  federal  policymakers.  The  fellow- 
ship will  allow  genetics  professionals  to  spend  a  year  as 
special  legislative  assistants  on  the  staff  of  members  of 
Congress  or  on  congressional  committees.  Directed  toward 
productive  scientists,  the  program  is  intended  to  attract 
independent  investigators. 

In  addition  to  educating  the  scientific  community  about  the 
public  policy  process,  the  fellowship  is  expected  to  dem- 
onstrate the  value  of  science-government  interactions  and 
make  practical  contributions  to  the  effective  use  of  scien- 
tific and  technical  knowledge  in  government.  The  program 
includes  an  orientation  to  legislative  and  executive  opera- 
tions and  a  year-long  weekly  seminar  on  issues  involving 
science  and  public  policy. 

Unlike  similar  government  programs,  this  fellowship  is 
aimed  primarily  at  scientists  outside  government.  It  em- 
phasizes policy-oriented  public  service  rather  than  obser- 
vational learning  and  designates  its  fellows  as  free  agents 
rather  than  representatives  of  their  sponsoring  societies. 

One  of  the  goals  of  DOE  and  ASHG  is  to  develop  a  group 
of  nongovernmental  professionals  who  will  be  equipped  to 
deal  with  issues  concerning  human  genetics  policy  devel- 
opment and  implementation,  particularly  in  the  current 
environment  of  health-care  reform  and  managed  care. 
Graduates  of  this  program  will  serve  as  a  resource  for  con- 
sultation in  the  development  of  public -health  policy  con- 
cerning genetic  disease. 

Fellowship  candidates  must  demonstrate  exceptional  basic 
understanding  of  and  competence  in  human  genetics;  hold 
an  earned  degree  in  genetics,  biology,  life  sciences,  or  a 
similar  field;  have  a  well-grounded  and  appropriately 
documented  scientific  and  technical  background;  have  a 
broad  professional  background  in  the  practice  of  human 
genetics  as  demonstrated  by  national  or  international  repu- 
tation; be  cognizant  of  related  nonscientific  matters  that 
impact  on  human  genetics;  exhibit  sensitivity  toward  po- 
litical and  social  issues;  have  a  strong  interest  and  some 
experience  in  applying  personal  knowledge  toward  the 
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solution  of  social  problems;  be  a  member  of  ASHG;  be 
articulate,  literate,  adaptable,  and  interested  in  working  on 
long-range  public  policy  problems;  be  able  to  work  with  a 
variety  of  people  of  diverse  professional  backgrounds;  and 
function  well  during  periods  of  intense  pressure. 

The  first  fellow  is  working  in  the  office  of  Senator 
Wellstone.  Democrat  from  Minnesota,  and  devoting  most 
of  his  time  to  studying  and  commenting  on  health-care  and 
science  issues. 

DOE  Grant  No.  DE-FG02-95ER61974. 

A  Hispanic  Educational  Program  for 
Scientific,  Ethical,  Legal,  and  Social 
Aspects  of  the  Human  Genome  Project 

Margaret  C.  Jefferson  and  Mary  Ann  Sesma' 

Department  of  Biology  and  Microbiology;  California  State 

University;  Los  Angeles  CA  90032 

213/343-2059,  Fax:  -2095,  mjejfer@flytrap.calstatela.edu 

http://yflylab.cal.itatela.edu/hgp 

'Los  Angeles  Unified  School  District 

The  primary  objectives  of  this  grant  are  to  develop,  imple- 
ment, and  distribute  culturally  competent,  linguistically 
appropriate,  and  relevant  curriculum  that  leads  to  Hispanic 
student  and  family  interactions  regarding  the  science,  ethi- 
cal, legal,  and  social  issues  of  the  Human  Genome  Project. 
By  opening  up  channels  of  familial  dialogue  between  par- 
ents and  their  high  school  students,  entire  families  can  be 
exposed  to  genetic  health  and  educational  information  and 
opportunities.  In  addition,  greater  interaction  is  anticipated 
between  students  and  teachers,  and  parents  and  teachers. 
In  the  Los  Angeles  Unified  School  District  alone,  over 
65%  of  the  approximately  850,000  student  enrollment  are 
bilingual  Hispanics.  The  1990  census  data  revealed  that 
the  U.S.A.  had  a  total  population  of  248,709,873,  of  which 
22,354,059  were  Hispanics,  and  thus,  there  is  a  need  for 
materials  to  be  disseminated  throughout  the  U.S.A.  that  are 
relevant  and  understandable  to  this  population. 

Student  curriculum  consists  of  BSCS  HGP-ELSI  curricu- 
lum available  in  both  English  and  Spanish;  supplemental 
lesson  plans  developed  and  utilized  by  high  school  teach- 
ers in  predominantly  Hispanic  classrooms  that  will  be 
available  via  the  World  Wide  Web;  student-developed  sur- 
veys that  ascertain  knowledge  and  perceptions  of  genetics 
and  HGP-ELSI  in  Hispanic  and  other  ethnic  communities 
in  the  greater  Los  Angeles  area;  the  University  of  Wash- 
ington High  School  Human  Genome  Program  exercises  on 
DNA  synthesis  and  sequencing;  and  career  ladders  and 
opportunities  in  genetics.  The  supplemental  lesson  plans 
are  focused  on  four  major  units:  the  Cell;  Mendelian  Ge- 
netics and  its  Extensions;  Molecular  Genetics;  and  the  Hu- 
man Genome  Project  and  ELSI.  The  concise  concepts  un- 
derlying each  unit  are  being  utilized  in  two  ways:  (a)  first, 


the  student  activities  emphasize  logical,  problem-solving 
exercises;  tools  or  technologies  applicable  to  that  concept; 
when  and  where  appropriate,  a  focus  on  the  Hispanic 
population;  and  an  understanding  of  the  problems  and 
compassion  for  the  families  associated  with  learning  of 
genetic  diseases,  (b)  second,  the  concepts  serve  as  the 
springboard  for  the  topics  that  the  students  include  in  sci- 
ence newsletters  to  their  parents.  In  addition  to  on-campus 
activities,  we  intend  to  arrange  field  trips  and/or  classroom 
demonstrations  of  genetic  and  molecular  biology  techniques 
by  scientists  and  other  experts.  The  speakers  would  also  be 
asked  to  discuss  career  opportunities  and  the  educational 
requirements  needed  to  enter  the  specific  careers  presented. 

The  parent  curriculum  consists  of  two  major  activities. 
First  the  student-parent  newsletter  is  designed  to  drawn  the 
parents  into  the  curriculum.  Students  write  newsletters  on 
a  biweekly  basis.  Each  newsletter  relates  to  a  student  cur- 
riculum subunit  and  the  specific  subunit  concepts.  English, 
Spanish,  social  science  as  well  as  biology  and  chemistry 
teachers  assist  the  students  in  its  production.  The  other  ma- 
jor activity  that  involves  the  parents  are  the  parent  focus 
groups.  Parents  from  each  participating  school  are  invited 
to  monthly  focus  groups  at  their  specific  campus.  The  fo- 
cus groups  discuss  issues  related  to  genetics  and  health, 
legal  and  social  issues  as  well  as  science  issues  that  stem 
from  the  student  newsletters.  The  discussions  are  in  both 
English  and  Spanish  with  translators  available.  Links  with 
other  programs  have  been  established. 

DOE  Grant  No.  DE-FG03-94ER61797. 

Implications  of  the  G«neticization  of 
Health  Care  for  Primary  Care 
Practitioners 

Mary  B.  Mahowald.  John  Lantos,  Mira  Lessick,  Robert 

Moss,  Lainie  Friedman  Ross,  Greg  Sachs,  and  Marion  Verp 

Department  of  Obstetrics  and  Gynecology  and  MacLean 

Center  for  Clinical  Medical  Ethics;  University  of  Chicago; 

Chicago,  IL  60637 

312/702-9300,  Fax:  -0840,  mm46@midwayuchicago.edu 

http://ccme-mac4.b.iduchicago.edu/CCMEHomePage.html 

"Geneticization"  refers  to  the  process  by  which  advances 
in  genetic  research  are  increasingly  applicable  to  all  areas 
of  health  care.'  Studies  show  that  primary  caregivers  are 
often  deficient  in  their  knowledge  of  genetics  and  genetic 
tests,  and  the  ethical,  legal,  and  social  implications  of  this 
knowledge."  Accordingly,  this  project  prepares  primary 
caregivers  who  have  no  special  training  in  genetics  or  ge- 
netic counseling  to  deal  with  the  implications  of  the  Hu- 
man Genome  Project  for  their  practice. 

Phase  I  (fall  1995):  Generic  topics  will  be  addressed  by  PI 
and  Co-PIs  with  Robert  Wood  Johnson  clinical  scholars 
and  clinical  ethics  fellows,  led  by  visiting  or  internal  experts. 
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Topics:  Goals,  Methods,  &  Achievements  of  the  HOP;  Ty- 
pology of  Genetic  Conditions;  Scientific.  Clinical,  Ethical, 
and  Legal  Aspects  of  Gene  Therapy;  Concepts  of  Disease; 
Genetic  Disabilities;  Gender  and  Socio-economic  Differ- 
ences; Cultural  and  Ethnic  Differences;  Directive  or  Non- 
directive  genetic  counseling. 

Speakers:  Jeff  Leiden;  Julie  Pahner;  Dan  Brock;  Anita  Sil- 
vers; Abby  Lippman;  James  Bowman;  Beth  Fine 

Phase  II  (Jan.-Mar.  1996);  Teams  of  individuals,  all 
trained  in  the  same  area  of  primary  care,  will  identify  and 
address  issues  specific  to  their  area,  developing  course  out- 
lines, bibliography,  and  methodology  based  on  grand 
rounds  given  by  national  expert. 

Primary  Care  Area 

Pediatrics:  Genetics  expert:  Stephen  Friend,  Ethics  Expert: 

Lainie  F.  Ross  -H  fellow 
Obstetrics/Gynecology:  Genetics  expert:  Joe  Leigh 

Simpson,  Ethics  Expert:  Marion  Verp  +  fellow 
Medicine:  Genetics  expert:  Tom  Caskey.  Ethics  Expert: 

Greg  Sachs  +  fellow 
Family  medicine:  Genetics  expert  Noralane  Lindor,  Ethics 

Expert:  Robert  Moss  -i-  fellow 
Nursing:  Genetics  expert:  Mira  Lessick,  Ethics  Expert: 

Colleen  Scanlon  +  fellow 

Phase  III  (Apr.-May  1996):  Policy  issues  will  be  identi- 
fied and  addressed  as  above  for  all  areas  of  primary  care, 
based  on  grand  rounds  given  by  national  expert. 

Policy  team:  Genetics  expert:  Sherman  Elias;  Ethics  ex- 
pert: John  L,antos  -H  trainee 

Phase  IV  (OcL-Dec.  1996):  Presentation  of  content  devel- 
oped to  new  group  of  fellows  and  scholars  by  each  of  the 
above  teams,  followed  by  evaluation  &  revision. 

Phase  V  (spring  1997):  NATIONAL  CONFERENCE  and 
CME/CNE  WORKSHOPS  for  primary  caregivers,  key- 
noted  by  Victor  McKusick. 

DOE  Grant  No.  DE-FG02-95ER61990. 
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Nontraditional  Inheritance:  Genetics 
and  the  Nature  of  Science;  Instructional 
Materials  for  High  School  Biology 

Joseph  D.  Mclnemey  and  B.  Ellen  Friedman 

Biological  Sciences  Curriculum  Study;  Colorado  Springs, 

CO  80918 

719/531-5550.  Fax:  -9\04,  jmcinemey@cc.colorado.edu 

There  often  is  a  gap  between  the  public's  and  scientists' 
views  of  new  research  findings,  particularly  if  the  public's 
understanding  of  the  nature  of  science  is  not  sound.  Large 
quantities  of  new  evidence  and  consequent  changes  in  sci- 
entific explanations,  such  as  those  associated  with  the  Hu- 
man Genome  Project  and  related  genetics  research,  can 
accentuate  those  different  views.  Yet  an  appealing  second- 
ary effect  of  the  unusually  fast  acquisition  of  data  is  that 
our  view  of  genetics  is  changing  rapidly  during  a  brief 
time  period,  a  relatively  recent  phenomenon  in  the  field  of 
biological  sciences.  This  situation  provides  an  outstanding 
opportunity  to  communicate  the  nature  and  methods  of 
science  to  teachers  and  students,  and  indirectly  to  the  pub- 
lic at  large.  The  immediacy  of  new  explanations  of  genetic 
mechanisms  lets  nontechnical  audiences  acmally  experi- 
ence a  changing  view  of  various  aspects  of  genetics,  and  in 
so  doing,  gain  an  appreciation  of  the  nature  of  science  that 
rarely  is  felt  outside  of  the  research  laboratory. 

The  Biological  Sciences  Curriculum  Study  (BSCS)  is  de- 
veloping a  curriculum  module  that  brings  this  active  view 
of  the  nature  and  methods  of  science  into  the  classroom 
via  examples  from  recent  discoveries  in  genetics.  We  will 
distribute  this  print  module  free  of  charge  to  interested 
high  school  biology  teachers  in  the  United  States. 

The  examples  selected  for  classroom  activities  include  the 
instability  of  trinucleotide  repeats  as  an  explanation  of  ge- 
netic anticipation  in  Huntington  disease  and  myotonic  dys- 
trophy, and  the  more  widespread  genetic  mechanism  of 
extranuclear  inheritance,  illustrated  by  mitochondrial  in- 
heritance. Background  materials  for  teachers  discuss  a 
wider  range  of  phenomena  that  require  nontraditional 
views  of  inheritance,  including  RNA  editing,  genomic  im- 
printing, transposable  elements,  and  uniparental  disomy. 
The  genetics  topics  in  the  module  share  the  common  char- 
acteristic that  they  are  not  adequately  explained  by  the  tra- 
ditional, Mendelian  concepts  that  are  taught  in  introduc- 
tory biology  at  the  high  school  level.  In  addition  to  updat- 
ing the  genetics  curriculum  and  communicating  the  natiure 
of  science,  the  module  devotes  one  activity  to  the  ethical 
and  social  aspects  of  new  genetics  discoveries  by  challeng- 
ing smdents  to  consider  the  current  reluctance  to  test  as- 
ymptomatic minors  for  the  presence  of  the  HD  gene. 

The  major  challenge  we  have  faced  in  this  project  is  to 
make  relatively  technical  genetics  information  accessible 
to  high  school  teachers  and  students  and  to  turn  the  often 
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passive  treatment  of  scientific  processes  into  an  active  ex- 
perience that  helps  students  develop  an  understanding  and 
appreciation  of  the  nature  and  methods  of  science.  The 
module  is  being  field  tested  in  classrooms  across  the  coun- 
try. Evaluation  data  from  the  field  test  will  guide  final  revi- 
sion of  the  module  prior  to  distribution. 

DOE  Grant  No.  DE-FG03-95ER61989. 


The  Human  Genome  Project:  Biology, 
Computers,  and  Privacy:  Development 
of  Educational  Materials  for  High 
School  Biology 

Joseph  D.  Mclnerney.  Lynda  B.  Midkas,  and  B.  Ellen 

Friedman 

Biological  Sciences  Curriculum  Study;  Colorado  Springs, 

CO  80918 

719/531-5550.  Fax:  -9\M,  jmcinemey@cc.colorado.edu 

One  of  the  challenges  faced  by  the  Human  Genome 
Project  (HGP)  is  to  handle  effectively  the  enormous  quan- 
tities and  types  of  data  that  emerge  as  a  result  of  progress 
in  the  project.  The  informatics  aspect  of  the  HGP  offers  an 
excellent  example  of  the  interdependence  of  science  and 
technology.  In  addition,  the  electronic  storage  of  genonuc 
information  raises  important  questions  of  ethics  and  public 
policy,  many  revolving  around  privacy. 

The  Biological  Sciences  Curriculum  Study  (BSCS)  ad- 
dresses the  scientific,  technological,  ethical,  and  policy 
aspects  of  genome  informatics  in  the  instructional  program 
titled  The  Human  Genome  Project:  Biology,  Computers, 
and  Privacy.  The  program,  intended  for  use  in  high  school 
and  college  biology,  consists  of  software  and  a  150-page 
print  module.  The  software  includes  two  model  databases: 
a  research  database  housing  anonymous  data  (map  data, 
sequence  data,  and  biological/clinical  information)  and  a 
registry  that  attaches  names  of  52  fictitious  individuals 
(three  kindreds)  to  genomic  data.  Students  manipulate  the 
database  software  as  they  work  through  seven  classroom 
inquiries  described  in  the  print  material.  Also  included  is 
50  pages  of  background  material  for  teachers. 

An  introductory  activity  lets  students  become  familiar  with 
the  software  and  dramatically  demonstrates  the  advantages 
of  technology  in  analysis  of  sequence  data.  In  activities  1 
and  2,  students  use  the  database  to  construct  pedigrees  and 
make  initial  choices  about  privacy  with  regard  to  genetic 
tests  for  their  fictitious  person.  Activity  3  expands  genetic 
anticipation,  and  in  activities  4  and  5,  students  deal  in 
depth  with  decision-making,  ethics,  and  public  policy,  re- 
visiting their  earlier  decision  about  testing  and  data  acces- 
sibility. A  final  extension  activity  shows  how  comparisons 
with  genomic  data  can  be  used  to  test  hypotheses  about  the 
biological  relationships  between  individual  humans  and 


about  the  evolutionary  significance  of  DNA  sequence 
similarities  between  different  species. 

External  reviews  and  evaluation  data  from  a  field  test  in- 
volving 1,000  students  in  schools  across  the  United  States 
were  used  to  guide  final  revision  of  the  materials.  BSCS 
will  distribute  the  module  free  of  charge  to  more  than 
10,000  high  school  and  college  biology  teachers. 

DOE  Grant  No.  DE-FO03-93ER61584. 

Involvement  of  High  School  Students  in 
Sequencing  the  Human  Genome 

Maureen  M.  Munn,  Maynard  V.  Olson,  and  Leroy  Hood 

Department  of  Molecular  Biotechnology,  University  of 

Washington;  Seattle,  WA  98 195 

206/616-4538,  Fax:  /685-7344,  mmunn® u.washington.edu 

For  the  past  two  years,  we  have  been  developing  a  pro- 
gram that  involves  high  school  students  in  the  excitement 
of  genetic  research  by  enabling  them  to  participate  in  se- 
quencing the  human  genome.  This  program  provides  high 
school  teachers  with  the  proper  training,  equipment,  and 
support  to  lead  their  students  through  the  exercise  of  se- 
quencing small  portions  of  DNA.  The  participating  class- 
rooms carry  out  two  experimental  modules,  DNA  synthe- 
sis (an  introduction  to  DNA  replication  and  the  techniques 
used  to  study  it)  and  DNA  sequencing.  Both  of  these  ex- 
periments consist  of  three  parts-synthesizing  DNA  frag- 
ments using  Sequenase  and  a  biotinlabeled  primer,  bench 
top  electrophoresis  using  denaturing  polyacrylamide  gels, 
and  colorimetric  DNA  detection  that  is  specific  for  the 
biotinylated  primer  Students  analyze  their  sequencing  data 
and  enter  it  into  a  DNA  assembly  program.  This  year,  in 
collaboration  with  Eric  Lynch  and  Mary-Claire  King  from 
the  Department  of  Genetics  at  the  University  of  Washing- 
ton, the  students  will  be  sequencing  a  region  of  chromo- 
some 5q  that  may  be  involved  in  a  form  of  hereditary  deaf- 
ness. 

Students  also  consider  the  ethical,  legal  and  social  issues 
(ELSI)  of  genome  research  in  a  unit  that  explores  the  topic 
of  presymptomatic  testing  for  Huntington's  disease  (HD). 
This  module  was  developed  by  Sharon  Durfy  and  Robert 
Hansen  from  the  Department  of  Medical  History  and  Eth- 
ics at  the  University  of  Washington.  It  provides  a  scenario 
about  a  family  that  carries  the  HD  allele,  descriptions  of 
the  clinical  and  genetic  aspects  of  the  disorder,  an  exercise 
in  drawing  pedigrees  and  an  autoradiograph  showing  the 
PCR  assay  used  to  detect  HD.  Students  use  an  ethical 
decision-making  model  to  decide  whether,  as  a  character 
from  the  scenario,  they  would  be  tested  presymptomati- 
cally  for  the  HD  allele.  Through  this  experience,  they  de- 
velop the  skills  to  define  ethical  issues,  ask  and  research 
the  relevant  questions  about  a  particular  topic  and  make 
justifiable  ethical  decisions. 
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In  the  first  two  years  of  this  program,  our  focus  was  on  the 
development  of  robust,  classroom  friendly  modules  that 
can  be  presented  in  up  to  six  classes  at  one  time.  This  year 
we  will  focus  on  disseminating  this  program  to  local,  re- 
gional, and  national  sites.  During  a  week-long  workshop  in 
July,  1995,  we  trained  an  additional  thirteen  high  school 
teachers,  bringing  our  current  number  to  twenty  teachers  at 
thirteen  schools.  We  have  recruited  local  scientists  to  act  as 
mentors  to  each  of  the  schools  and  provide  classroom  sup- 
port. On  the  regional  level,  four  of  our  teachers  are  from 
outside  the  greater  Seattle  area  and  will  be  supported  dur- 
ing the  classroom  experiments  by  scientists  in  their  region. 
We  have  presented  this  program  at  national  meetings  and 
workshops,  including  the  Human  Genome  Teacher  Net- 
working Project  Woilcshop  in  Kansas  City,  KS  (June, 
1995)  and  the  meeting  of  the  National  Association  of  Biol- 
ogy Teachers  in  Phoenix,  AZ  (October  1995).  We  have 
also  distributed  our  modules  to  teachers  and  scientists 
throughout  the  nation  to  encourage  the  development  of 
similar  programs.  This  year  we  will  also  develop  and  pilot 
a  module  using  automated  sequencing.  This  will  enable 
distant  schools  to  participate  in  the  program  by  providing 
them  with  the  option  of  sending  their  DNA  samples  to  the 
UW  genome  center  for  electrophoresis . 

WTiile  we  hope  the  human  genome  sequencing  experience 
will  interest  some  students  in  science  careers,  a  broader 
goal  is  to  encourage  high  school  students  to  think  con- 
structively and  creatively  about  the  implications  of  scien- 
tific findings  so  that  the  coming  generation  of  adults  will 
make  judicious  decisions  affecting  public  policies. 

DOE  Grant  No.  DE-FG03-96ER62175. 

The  Gene  Letter:  A  Newsletter  on 
Ethical,  Legal,  and  Social  Issues  in 
Genetics  for  Interested  Professionals 
and  Consumers 

Philip  J.  ReUly,  Dorothy  C.  Wertz,  and  Robin  J.R.  Blatt' 

The  Shriver  Center  for  Mental  Retardation;  Division  of 
Social  Science,  Ethics  and  Law;  Waltham,  MA  02254 
617/642-0230,  Fax:  l%9^-5}A0,  preilly@shriver.org 
'Also  at  Massachusetts  Department  of  Public  Health,  Bos- 
ton, MA 
hnp://www.shriverorg 

We  propose  to  develop  a  newsletter  on  ELSI-related  issues 
for  dissemination  to  a  broad  general  audience  of  profes- 
sionals and  consumers.  No  such  focussed  public  newsletter 
currently  exists.  Entitled  The  Gene  Letter,  the  newsletter 
will  be  distributed  monthly  on-line,  through  the  Internet. 
Updated  weekly  on  the  Internet,  it  will  be  poised  to  react 
in  a  timely  fashion  to  new  developments  in  science,  law, 
medicine,  ethics,  and  culture.  The  newsletter  does  not  pro- 
pose to  provide  comprehensive  education  in  genetics  for 


the  American  public,  but  rather  to  begin  an  information 
network  that  interested  people  can  use  for  further  informa- 
tion. It  will  be  the  roost  widely-distributed  newsletter  on 
ELSI  genetics  in  the  world,  with  the  largest  consumer 
readership.  Features  will  be  largely  informational  and  will 
include  new  scientific/medical  developments  and  attendant 
ELSI  issues,  new  court  decisions,  legislation,  and  regula- 
tions, balanced  responses  to  new  concerns  in  the  media, 
and  new  developments  related  to  health  that  may  be  of  in- 
terest to  health  care  providers  and  consumers.  Features 
will  present  balanced  opinions.  An  editorial  board  will  re- 
view each  issue,  prior  to  publication,  for  cultural  sensitiv- 
ity, emphasis,  balance,  and  concerns  of  persons  with  dis- 
abilities. The  Gene  Letter  will  also  include  factual  infor- 
mation on  upcoming  events,  new  ELSI  research,  where  to 
fmd  genetics  on  the  Internet,  new  publications  (annotated), 
and  where  to  fmd  further  information  about  each  feature. 
Readers  will  be  invited  to  send  letters,  queries,  news,  bibli- 
ography, comments,  and  consumer  concerns  either  on  The 
Gene  Letter  Internet  chatroom  or  in  hard  copy.  A  hard 
copy  of  the  fu^t  on-line  issue  will  be  used  to  assess  read- 
ers' needs  and  interests.  It  will  be  distributed  to  5(X)  com- 
mimity  college  students  representing  blue-collar  ethnic 
groups,  and  to  2000  members  of  a  broad  general  audience. 

A  special  evaluation  of  readers'  knowledge  and  ethical/ 
social  concerns  raised  by  The  Gene  Letter  will  take  place 
at  the  end  of  the  second  year  in  order  to  assess  outcome.  It 
is  oiu-  intention  that  The  Gene  Letter  become  self-support- 
ing after  two  years. 

DOE  Grant  No.  DE-FG02-%ER62 1 74. 


The  DNA  Files:  A  Nationally 
Syndicated  Series  of  Radio  Programs 
on  the  Social  Implications  of  Human 
Genome  Research  and  Its  Applications 

Bari  Scott,  Matt  Binder,  and  Jude  Thilman 

Genome  Radio  Project;  KPFA-FM;  Berkeley,  CA  94704 

510/848-6767  ext  235,  Fax:  /883-0311,  strp@aol.com 

The  DNA  File.i  is  a  series  of  nationally  distributed  public 
radio  programs  furthering  public  education  on  develop- 
ments in  genetic  science.  Program  content  is  guided  by  a 
distinguished  body  of  advisors  and  will  include  the  voices 
of  prominent  genetic  researchers,  people  affected  by  ad- 
vances in  the  clinical  application  of  genetic  medicine, 
members  of  the  biotech  industry,  and  others  from  related 
fields.  They  will  provide  real-life  examples  of  the  complex 
social  and  ethical  issues  associated  with  new  discoveries  in 
genetics.  In  addition  to  the  general  public  radio  audience, 
the  series  will  target  educators,  scientists,  and  involved 
professionals.  Ancillary  educational  materials  will  be  dis- 
tributed in  paper  and  digital  form  through  over  (wo  dozen 
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collaborative  organizations  and  fulfillment  of  listener  re- 
quests. 

"DNA  and  Behavior  Is  Our  Fate  Written  in  Chir  Genes?" 
is  the  pilot  documentary  for  the  series,  scheduled  for  re- 
lease in  early  1996.  The  show  will  help  the  lay  person  un- 
derstand and  evaluate  recent  research  in  the  area  of  behav- 
ioral genetics.  Recently,  we've  seen  news  media  reports  on 
newly  discovered  genetic  factors  being  related  to  behav- 
iors such  as  alcoholism,  mental  illness,  sexual  orientation 
and  aggression.  This  program  will  look  at  several  ex- 
amples of  these  "genetic  factors"  and  evaluate  the 
strengths  and  weaknesses  of  various  methodologies  in- 
volved in  the  research;  and  introduce  such  controversial 
issues  as  the  re-emergence  of  a  eugenics  movement  based 
on  theoretical  suppositions  drawn  from  recent  work  in  be- 
havioral genetics. 

With  information  linking  major  diseases  such  as  breast 
cancer,  colon  cancer,  and  arteriosclerosis  to  genetic  fac- 
tors, new  dangers  in  public  perception  emerge.  Many 
people  who  hear  about  them  mistakenly  conclude  that 
these  diseases  can  now  be  easily  diagnosed  and  even 
cured.  On  the  other  end  of  the  public  perception  spectrum, 
unfounded  fears  of  extreme,  and  highly  unlikely,  conse- 
quences also  appear.  Will  society  now  genetically  engineer 
whole  generations  of  people  with  "designer  genes"  offer- 
ing more  "desirable  physical  qualities"?  The  DNA  Files 
will  ground  public  understanding  of  these  issues  in  reality. 
"DNA  and  the  Law"  reviews  the  scientific  basis  for  ge- 
netic fmgerprinting  and  looks  at  cases  of  alleged  genetic 
discrimination  by  insurance  companies,  employers  and 
others.  This  program  also  looks  at  disputes  over  paternity, 
intellectual  property  rights,  the  commercialization  of  ge- 
netic information,  informed  consent  and  privacy  issues. 
Other  shows  include  "The  Search  for  a  Breast  Cancer 
Gene,"  "Prenatal  Genetic  Testing  and  Treatment,"  "Evolu- 
tion and  Genetic  Diversity,"  "Sickle-Cell  Disease  and 
Thalassemia:  Hope  for  a  Cure,"  and  "Theology,  Mythol- 
ogy and  Human  Genetic  Research." 

DOE  Grant  No.  DE-FGO3-95ER62003. 


Communicating  Science  in  Plain 
Language:  The  Science+  Literacy  for 
Health:  Human  Genome  Project 

Maria  Sosa,  Judy  Kass,  and  Tracy  Gath 

American  Association  for  the  Advancement  of  Science; 

Washington,  DC  20005 

202/326-6453,  Fax:  /37I-9849,  m.iosa@aaas.org 

Recent  literacy  surveys  have  found  that  a  large  number  of 
adults  lack  the  skills  to  bring  meaning  to  much  of  what  is 
written  about  science.  This,  in  effect,  denies  them  access  to 
vital  information  about  their  health  and  well-being.  To  ad- 


dress this  need,  the  American  Association  for  the  Advance- 
ment of  Science  (AAAS)  is  developing  a  2-year  project  to 
provide  low-literate  adults  with  the  background  knowledge 
necessary  to  address  the  social,  ethical,  and  legal  implica- 
tions of  the  Human  Genome  Project. 

With  its  Science  ■•■  Literacy  for  Health:  Human  Genome 
Project,  AAAS  is  using  its  existing  network  of  adult  edu- 
cation providers  and  volunteer  science  and  health  profes- 
sionals to  pursue  the  following  overall  objectives:  (1)  to 
develop  new  materials  for  adult  literacy  classes,  including 
a  high-interest  reading  book  and  accompanying  curricu- 
lum, an  implementation  framework,  a  short  video  provid- 
ing background  information  on  genetics,  a  database  of  re- 
sources, and  fact  sheets  that  will  assist  other  organizations 
and  researchers  in  preparing  easy-to-read  materials  about 
the  human  genome  project,  and  (2)  to  develop  and  conduct 
a  campaign  to  disseminate  project  materials  to  libraries 
and  community  organizations  carrying  out  literacy  pro- 
grams throughout  the  United  States. 

Because  not  every  low-literate  adult  is  enrolled  in  a  lit- 
eracy class,  our  model  for  helping  scientists  communicate 
in  simple  language  will  have  impact  beyond  classrooms 
and  learning  centers.  In  preliminary  conucts,  community 
groups  providing  health  services  have  indicated  that  the 
proposed  materials  are  not  only  desirable  but  needed;  in- 
deed such  groups  often  receive  requests  for  information  on 
heredity  and  genetics.  The  module  developed  by  AAAS 
should  enable  other  medical  and  scientific  organizations  to 
communicate  more  effectively  with  economically  disad- 
vantaged populations,  which  often  include  a  large  number 
of  low-literate  individuals. 

DOE  Grant  No.  DE-FG02-95ER6I988. 


The  Community  College  Initiative 

Sylvia  J.  Spengler  and  Laurel  Egenberger 

Lawrence  Berkeley  National  Laboratory;  Berkeley,  CA 

94720 

510/486-4879,  Fax:  -5717,  sjspengler@lbl.gov 

http://csee.lbl.gov/cup/ccibiotech/lndex.htmt 

The  Community  College  Initiative  prepares  community 
college  students  for  work  in  biotechnology.  A  combined 
effort  of  Lawrence  Berkeley  National  Laboratory  (LBNL) 
and  the  California  Community  Colleges,  we  aim  to  de- 
velop mechanisms  to  encourage  students  to  pursue  science 
studies,  to  participate  in  forefront  laboratory  research,  and 
to  gain  work  experience.  The  initiative  is  structured  to  up- 
grade the  skills  of  students  and  their  instructors  through 
four  components. 

Summer  Student  Workshops:  Four  weeks  summer  resi- 
dential programs  for  students  who  have  completed  the  first 
year  of  the  biotechnology  academic  program.  Ethical,  legal 
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and  social  concents  are  integrated  into  the  laboratory  exer- 
cises and  students  learn  to  identify  commonly  shared  val- 
ues of  the  scientific  community  as  well  as  increase  their 
understanding  of  issues  of  personal  and  public  concern. 

Teacher  Workshop  Training:  Seminars  for  biotechnology 
instructors  to  improve,  upgrade,  and  update  their  under- 
standing of  current  technology  and  laboratory  practices, 
with  emphasis  on  curriculum  development  in  current  top- 
ics in  ethical,  legal,  and  social  issues  in  science. 

Sabbatical  Fellowships:  For  community  college  instruc- 
tors to  provide  investigative  and  field  experience  in  re- 
search laboratories.  During  the  fellowship,  teachers  also 
assist  in  development  of  student  summer  research  activi- 
ties. 

Summer  Faculty-Student  Teams:  Post-fellowship  faculty 
and  biotechnology  students  who  have  finished  their  second 
year  of  study  team  on  a  research  project. 

Genome  Ekiucators 

Sylvia  Spengler  and  Janice  Mann 

Human  Genome  Program;  Life  Sciences  Division; 

Lawrence  Berkeley  National  Laboratory;  Bertceley.  CA 

94720 

510/486-4879.  Fax:  -5717.  sjspengler@lbl.gov  or 

jlmann@U}Lgov 

hnp://www.lbl.gov/Education/Genome 

Genome  Educators  is  an  informal  network  of  educational 
professionals  who  have  an  active  interest  in  all  aspects  of 
genetics  research  and  education.  This  national  group  in- 
cludes scientists,  researchers,  educational  curriculum  de- 
velopers, ethicists,  health  professionals,  high  school  teach- 
ers and  instructors  at  college  and  graduate  levels,  and  oth- 
ers in  occupations  affected  by  genetic  research. 

Genome  Educators  is  a  unique  collaborative  effort  dedi- 
cated to  sharing  information  and  resources  to  further  un- 
derstanding of  current  advances  in  the  field  of  genetics. 
Seminars,  workshops,  and  special  events  are  sponsored  at 
frequent  intervals.  Genome  Educators  maintains  an  active 
World  Wide  Web  site  (URL:  http://www.lbl.gov/Educa- 
tion/Genome).  This  site  contains  a  calendar  of  events,  di- 
rectory of  participating  genome  educators,  and  information 
about  educational  resources  and  reference  tools.  Participat- 
ing genome  educators  may  publish  articles  and  talks  of 
interest  at  this  site.  In  addition,  a  monitored  discussion 
group  is  maintained  to  facilitate  dialog  and  resource  shar- 
ing among  participants. 


Getting  the  Word  Out  on  the  Human 
Genome  Project:  A  Course  for 
Physicians 

Sara  L.  Tobin  and  Ann  Boughton' 

Department  of  Biochemistry  and  Molecular  Biology; 
Center  for  Biomedical  Ethics;  Stanford  University;  Palo 
Alto.  C  A  94304-1709 

415/725-2663.  Fax:  -6131.  tobinsl@leland.stanford.edu 
'Thumbnail  Graphics;  Oklahoma  City.  OK  73 1 18 

Progressive  identification  of  new  genes  and  implications 
for  medical  treatment  of  genetic  diseases  appear  almost 
daily  in  the  scientific  and  medical  literature,  as  well  as  in 
public  media  reports.  However,  most  individuals  do  not 
understand  the  power  or  the  promise  of  the  current  explo- 
sion in  knowledge  of  the  human  genome.  This  is  also  true 
of  physicians,  most  of  whom  completed  their  medical 
training  prior  to  the  application  of  recombinant  DNA  tech- 
nology to  medical  diagnosis  and  treatment.  This  lack  of 
training  prevents  physicians  from  appreciating  many  of  the 
recent  advances  in  molecular  genetics  and  may  delay  their 
acceptance  of  new  treatment  regimens.  In  particular,  physi- 
cians practicing  in  rural  communities  are  often  limited  in 
their  access  to  resources  that  would  bring  them  into  the 
mainstream  of  current  molecular  developments.  This 
project  is  designed  to  fill  two  important  functions:  fu^t.  to 
provide  solid  training  for  physicians  in  the  field  of  molecu- 
lar medical  genetics,  including  the  impact,  implications, 
and  potential  of  this  field  for  the  treatment  of  human  dis- 
ease; second,  to  utilize  physicians  as  informed  community 
resources  who  can  educate  both  their  patients  and  commu- 
nity groups  about  the  new  genetics. 

We  propose  to  develop  a  flexible,  user-friendly,  interactive 
multimedia  CD-ROM  designed  for  continuing  education 
of  physicians  in  applications  of  molecular  medical  genet- 
ics. To  initiate  these  objectives,  we  will  develop  the  design 
of  the  CD  and  will  produce  a  prototype  providing  a  de- 
tailed presentation  of  one  of  the  four  training  areas.  These 
areas  are  (I)  Genetics,  including  DNA  as  a  molecular  blue- 
print, chromosomes  as  vehicles  for  genetic  information, 
and  patterns  of  inheritance;  (2)  Recombinant  techniques, 
stressing  cloning  and  analytical  tools  and  techniques  ap- 
plied to  medical  case  studies;  (3)  Current  and  future  clini- 
cal applications,  encompassing  the  human  genome  project, 
technical  advances,  and  disease  diagnosis  and  prognosis; 
and  (4)  Societal  implications,  focusing  on  approaches  to 
patient  counseling,  genetic  dilemmas  faced  by  patients  and 
practitioners,  and  societal  values  and  development  of  an 
ethical  consensus.  Area  (2)  will  be  presented  in  the  proto- 
type. 

The  CD  format  will  permit  the  use  of  animation,  video, 
and  audio,  in  addition  to  graphic  illustrations  and  photo- 
graphs. We  will  build  on  our  existing  base  of  computer 
generated  illustrations.  A  hypertext  glossary,  user  notes. 
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practice  tests,  and  customized  senings  will  be  utilized  to 
tailor  the  CD  to  the  needs  of  the  user.  Brief, 
multiple-choice  examinations  will  be  evaluated  for  con- 
tinuing medical  education  credits  by  the  Office  of  Continu- 
ing Medical  Education.  The  CD  will  be  programmed  to 
permit  updates  of  scientific  and  medical  advances  either 
by  downloading  from  the  Internet  or  from  a  disc  available 
by  subscription. 

This  is  a  cooperative  project  involving  individuals  with 
documented  expertise  in  teaching  of  molecular  medical 
genetics,  continuing  medical  education,  graphic  design, 
and  CD-ROM  production.  The  content  of  the  CD  will  be 
supervised  by  a  scientific  board  of  directors.  We  present 
mechanisms  for  the  evaluation  of  the  CD  by  rural  Okla- 
homa physicians.  Arrangements  have  been  made  for  distri- 
bution of  the  CD  by  a  national  publisher  of  medical  and 
scientific  materials.  This  CD  will  provide  a  powerful  tool 
to  educate  physicians  and  the  public  about  the  power  and 
potential  of  the  human  genome  project  for  the  benefit  of 
human  health. 

DOE  Grant  No.  DE-FG03-96ER62172. 

The  Genetics  Adjudication  Resource 
Project 

Franklin  M.  Zweig 

Einstein  Institute  for  Science,  Health,  and  the  Courts; 

Bethesda.MD  20814 

301/961-1949,  Fax:  /9I3-0448,  emshac@aot.com 

http://www.omLgov/courts 

The  Einstein  Institute  for  Science,  Health,  and  the  Courts 
is  preparing  the  foundation  for  a  new  utility  needed  to  pre- 
pare the  nation's  21,000  courts  to  adjudicate  the  genetics 
and  ELSI-related  issues  that  foreseeably  will  rush  into  the 
courtroom  as  the  Human  Genome  Project  completes  its 
genomic  mapping  and  sequencing  mission  during  the  next 
ten  years.  This  project  initiates  practical  collaboration 
among  courts,  legal  and  policy-making  institutions,  and 
science  centers  leading  to  modalities  for  understanding  the 
scientific  vaUdity  of  claims,  and  for  the  resolution  of  ethi- 
cal, legal,  and  social  disputes  arising  within  the  genetic 
testing  and  gene  therapy  contexts.  Our  objective  over  the 
ensuing  decade  is  to  facilitate  genetic  testing  and  gene 
therapy  dispute  management,  and  to  avoid  to  the  extent 
possible  the  confusion  that  characterized  adjudication  of 
forensic  DNA  technologies  during  the  decade  just  ended. 

The  outlines  of  a  genetics  adjudication  utility  were  given 
form  by  the  1995  Working  Conversation  on  Genetics,  Evo- 
lution, and  the  Courts,  involving  37  federal  and  state 
judges  and  others  in  science  and  policymaking  leadership 
positions  from  across  the  nation.  The  courts  are  becoming 
aware  of  genetics,  molecular  biology,  and  their  applica- 
tions, and  judges  want  public  confidence  to  be  maintained 
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as  the  profound  and  complex  issues  set  in  motion  by  the 
HGP  begin  the  long  course  of  litigation.  Modalities  for 
understanding  the  underpinning  science  are  needed,  as 
well  as  instrumentalities  to  assure  that  the  best  cases  are 
actually  filed  and  piu^ued.  Because  the  courts  are  the 
front-line  for  resolving  disputes,  creative  lawyering  will 
assure  an  abundance  of  lawsuits.  Many  such  lawsuits  will 
request  the  coiuts  to  make  policy  judgments,  perhaps  best 
undertaken  by  state  legislatures  and  Congress.  Accord- 
ingly, a  new  adjudication  utility  should  provide  forums  for 
judicial/legislative  exchange,  preparatory  deliberations  in 
anticipation  of  pressure  to  make  rushed  policies  under  con- 
ditions of  great  social  uncertainty  in  the  wake  of  human 
genetics  progress. 

EINSHAC  will  provide  a  design,  planning,  communica- 
tions, and  implementation  center  for  a  multipurpose  re- 
source project  available  to  the  courts.  It  will  undertake 
over  an  18  month  period  the  following  tasks,  pilot-testing 
each  and  assessing  the  best  organizational  locales  for  those 
that  exhibit  promise: 

1.  Judicial  Education  in  Genetics  &  ELSI-Related  Issues 
for  six  Judicial  Branch  leadership  associations  and  nine 
metropolitan  courts — aimed  at  1.000  judges — in  conjunc- 
tion with  scientific  faculty  and  coaches  mobilized  by 
DOE/national  laboratories  and  the  American  Society  for 
Human  Genetics. 

2.  Judicial  Digital  Electronic  Collegium — technological 
modernization  of  the  courts  community  by  providing  ac- 
cess to  ELSI  and  genetics  information  through  Internet 
resources. 

3.  Amicus  Brief  Development  Trust  Fund — a  process  and 
resources  to  support  law  development  at  the  state  and  fed- 
eral appeals  courts  level. 

4.  Genetics  Indigent  Party  Trust  Fund — a  process  and  re- 
sources at  the  state  and  federal  trial  level  to  sustain  merito- 
rious civil  cases  holding  promise  of  effective  law  develop- 
ment 

5.  Establishment  of  a  Pro-Bono  Legal  Services  Clearing- 
house— a  personal  and  on-line  referral  resource  for  per- 
sons seeking  representation  for  genetics  and  ELSI-related 

cases. 

6.  Access  to  Neutral  Expert  Witnesses — advisors  to  courts 
encountering  particularly  complex  cases  deemed  right  for 
the  judicial  exercise  of  Federal  Rule  of  Evidence  706  and 
its  State  counterparts. 

7.  Pilot  of  Judicial/Legislative  ELSI  Policy  Forums — pro- 
vision of  neutral  staff  and  coordination  in  three 
mid-Atlantic  states  considering  legislation  related  to  health 
care,  insurance,  privacy,  medical  records. 
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8.  National  Training  Center  for  Minority  Justice  Person-  rectors  that  includes  prominent  judges,  justices  and  scien- 

nel — facilitating  a  leadership  preparation  program  for  the  lists,  several  of  whom  participated  in  the  1995  Working 

nation's  minority  court-related  personnel  in  a  consortium  Conversation  on  Genetics,  Evolution  and  the  Courts.  As  a 

arrangement  with  the  Ruffin  Society  of  Massachusetts,  the  continuing  guidance  forum,  EINSHAC  will  conduct  a 

College  of  Criminal  Justice  at  Northeastern  University,  Working  Conversation  followup  in  Orleans,  Cape  Cod  in 

and  the  Flaschner  Judicial  Institute.  July,  1996. 

The  Project  actively  involves  judges,  scientists,  and  promi-  DOE  Grant  No.  DE-FGO2-96ER6208 1 . 
nent  lawyers.  It  will  report  to  the  EINSHAC  Board  of  Di- 
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Alexander  Hoilaender  Distinguished 
Postdoctoral  Fellowships 

Linda  Holmes  and  Eugene  Spejewski 

Oak  Ridge  Institute  for  Science  and  Education;  Oak  Ridge, 

TN  37831-01 17 

423/576-3192,  Fax:  /24I-5220,  holme.sl@orau.gov  or 

alexpgm@orau.gov 

http://www.orau.gov/oher/hollaend.hrm 

The  Alexander  Hoilaender  Distinguished  Postdoctoral  Fel- 
lowships, sponsored  by  the  Department  of  Energy  (DOE), 
Office  of  Health  and  Environmental  Research  (OHER), 
support  research  in  the  fields  of  life,  biomedical,  and  envi- 
ronmental sciences.  Since  the  EXJE  Human  Genome  Dis- 
tinguished Postdoctoral  Fellowships  and  DOE  Global 
Change  Distinguished  Postdoctoral  Fellowships  both  had 
their  last  application  cycles  in  FY  1995,  the  Hoilaender 
program  is  now  open  to  recent  PhD  graduates  in  the  fields 
of  huntan  genome  and  global  change,  as  well. 

Fellowships  of  up  to  2  years  are  tenable  at  any  DOE,  uni- 
versity, or  private  laboratory  providing  the  proposed  ad- 
viser at  that  laboratory  receives  at  least  $150,000  per  year 
in  support  from  OHER.  Fellows  earn  stipends  of  $37,500 
the  first  year  and  $40,500  the  second.  To  be  eligible,  appli- 
cants must  be  U.S.  citizens  or  permanent  residents  at  the 
time  of  application,  and  must  have  received  their  doctoral 
degrees  within  two  years  of  the  earliest  possible  starting 
date,  which  is  May  I  of  the  appointment  year. 

The  Oak  Ridge  Institute  for  Science  and  Education 
(ORISE),  administrator  of  the  fellowships,  prepares  and 
distributes  program  literature  to  universities  and  laborato- 
ries across  the  country,  accepts  applications,  convenes  a 
panel  to  make  award  recommendations,  and  issues  stipend 
checks  to  fellows.  The  review  panel  identifies  finalists 
from  which  DOE  selects  the  award  winners.  Deadline  for 
the  FY  1999  fellowship  cycle  is  January  15,  1998.  For 
more  information  or  an  application  packet,  contact  Linda 
Holmes  at  the  Oak  Ridge  Institute  for  Science  and  Educa- 
tion, R  O.  Box  117,  Oak  Ridge,  TN  37831-0117  (423/ 
576-9975.  Fax:  /24 1-5220). 

DOE  Contract  No.  DE-AC05-760R00033. 


Human  Genome  Management 
Information  System 

Betty  K.  Mansfield,  Anne  E.  Adamson,  Denise  K.  Casey, 

Sheryl  A.  Martin,  John  S.  Wassom,  Judy  M.  Wyrick, 

Laura  N.  Yust,  Murray  Browne,  and  Marissa  D.  Mills 

Life  Sciences  Division;  Oak  Ridge  National  Laboratory; 

Oak  Ridge,  TN  37830 

423/576-6669,  Fax:  /574-9888,  bkq@oml.gov 

http://www.oml.gOv/hgmi.i 


The  Human  Genome  Management  Information  System 
(HGMIS),  established  in  1989,  provides  information  about 
the  international  Human  Genome  Project  in  print  and 
World  Wide  Web  formats  to  both  technical  and  general 
audiences.  HGMIS  is  spoasored  by  the  Human  Genome 
Program  Task  Group  of  the  DOE  Office  of  Biological  and 
Environmental  Research  to  help  fulfill  DOE's  commitment 
to  informing  scientists,  policymakers,  and  the  public  about 
the  program's  funded  research  and  the  context  in  which  the 
research  is  conducted.  Several  HGMIS  products,  including 
the  Web  sites  and  newsletter,  have  won  technical  and  elec- 
tronic communication  awards. 

HGMIS  goals  center  on  facilitating  research  at  the  inter- 
face of  genomics  and  other  biological  disciplines  that  seek 
revolutionary  solutions  to  biological,  environmental,  and 
biomedical  challenges.  By  communicating  information 
about  the  Human  Genome  Project  and  its  impact,  HGMIS 
increases  the  use  of  project-generated  resources,  reduces 
duplicative  research  efforts,  and  fosters  collaborations  and 
contributions  to  biology  from  other  research  disciplines. 

Furthermore,  communicating  scientific  and  societal  issues 
to  nonscientist  audiences  contributes  to  increased  science 
literacy,  thus  laying  a  foundation  for  more  informed  deci- 
sion making  and  public-policy  development.  For  example, 
since  1995  HGMIS  has  been  participating  in  a  project  to 
educate  the  judiciary  about  the  basics  of  genetics  and  gene 
testing.  The  aim  is  to  prepare  judges  for  the  flood  of  cases 
involving  genetic  evidence  that  soon  will  enter  the  nation's 
courtrooms. 

Information  Resources 

In  keeping  with  its  goals,  HGMIS  produces  the  following 
information  resources  in  print  and  on  the  Web: 

Human  Genome  News  (HGN).  A  quarterly  forum  for  in- 
terdisciplinary information  exchange,  HGN  uniquely  pre- 
sents a  broad  spectrum  of  topics  related  to  the  Human  Ge- 
nome Project  in  a  single  publication.  Articles  feature  topics 
that  include  project  goals,  progress,  and  direction;  avail- 
able resources;  appUcations  of  project  data  and  resources 
to  provide  a  better  understanding  of  biological  processes; 
related  or  spinoff  programs;  medical  uses  of  genome  data; 
ethical,  legal,  and  social  considerations;  legislative  up- 
dates; other  publications;  meeting  calendars;  and  fimding 
information.  Most  HGN  articles  also  contain  sources  of 
additional  information.  In  May  1997,  DOE  acknowledged 
the  newsletter's  value  by  presenting  an  exceptional  service 
award  to  WC/Vs  managing  editor  at  a  symposium  celebrat- 
ing 50  years  of  biological  and  environmental  research. 

Among  14,000  domestic  and  foreign  HGN  subscribers  are 
genome  and  basic  researchers  at  universities,  national 
laboratories,  nonprofit  organizations,  and  industrial  facili- 
ties; educators;  industry  representatives;  legal  personnel; 
ethicists;  students;  genetic  counselors;  medical  profession- 
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als;  science  writers;  and  other  interested  individuals.  All  41 
issues  of  HGN.  indexed  and  searchable,  are  accessible  via 
the  HGMIS  Web  site. 

Other  Publications.  HGMIS  also  produces  the  DOE 
Primer  on  Molecular  Genetics,  progress  reports  on  the 
DOE  Human  Genome  Program,  Santa  Fe  contractor- 
grantee  workshop  proceedings,  1-page  topical  handouts, 
and  other  related  resource  documents.  Expanded  and  re- 
vised by  HGMIS  from  an  earlier  DOE  document,  the  DOE 
Primer  on  Molecular  Genetics  continues  to  be  in  demand. 
It  is  used  as  a  handout  for  genome  centers;  a  resource  for 
new  staff  training  by  companies  that  make  products  for 
genome  scientists;  and  an  educational  tool  for  teachers, 
genetic  counselors,  and  such  organizations  as  high  schools, 
universities,  and  medical  schools  for  student  and 
continuing-education  curricula.  More  than  35,000  hard 
copies  have  been  distributed.  The  primer  also  is  available 
in  several  formats  at  the  HGMIS  Web  site,  including  an 
Adobe  Acrobat  version  that  can  be  used  to  print  "origi- 
nals" firom  users'  printers. 

Distribution  of  Documents.  HGMIS  has  distributed  more 
than  65,000  copies  of  items  requested  by  subscribers, 
meeting  attendees,  and  managers  of  genetics  meetings  and 
educational  events.  These  items  include  HGN,  program 
and  workshop  reports,  DOE-hflH  5-year  plans,  DOE 
Primer  on  Molecular  Genetics,  and  To  Know  Ourselves. 
On  request,  HGMIS  supplies  multiple  copies  of  publica- 
tions for  meetings  and  educational  purposes. 

Electronic  Communications.  In  November  1994,  HGMIS 
began  producing  a  comprehensive,  text-based  Web  server 
called  Human  Genome  Project  Information,  which  is  de- 
voted to  topics  relating  to  the  science  and  societal  issues 
surrounding  the  genome  project.  In  July  1997,  this  site  was 
divided  to  better  serve  the  two  diverse  audience  categories 
that  represent  the  majority  of  users:  scientists  and  the  pub- 
lic. The  sites  contain  more  than  1700  text  files  that  are  ac- 
cessed over  1.2  million  times  each  year.  Each  month, 
about  10,000  host  computers  connect  to  the  HGMIS  sites 
directly  and  through  more  than  1000  other  Web  sites.  In 
addition,  HGMIS  hnks  to  the  National  Institutes  of  Health 
and  international  Human  Genome  Organisation  sites,  as 
well  as  to  sites  dedicated  to  education  and  to  the  ethical, 
legal,  and  social  implications  of  the  Human  Genome 
Project. 

All  HGMIS  publications  are  published  on  the  Web  site, 
along  with  such  DOE-sponsored  documents  as  Your 
Genes,  Your  Choices;  the  Genetic  Privacy  Act;  and  histori- 
cal and  other  documents  pertaining  to  the  Human  Genome 
Project.  HGMIS  collaborates  with  the  Einstein  Institute  for 
Science,  Health,  and  the  Courts  to  produce  CASOLM,  the 
online  magazine  for  judicial  education  in  genetics  and  bio- 
medical issues.  HGMIS  also  maintains  the  Genetics  sec- 
tion of  the  Virtual  Library  firom  CERN  (Switzerland)  and 


the  DOE  Human  Genome  Program  pages  and  moderates 
the  BioSci  Human  Genome  Newsgroup. 

Information  Source 

HGMIS  answers  individual  questions  and  supplies  general 
information  about  the  Human  Genome  Project  by  tele- 
phone, fax.  and  e-mail  and,  as  appropriate,  links  scientists 
with  questions  to  appropriate  Human  Genome  Project  con- 
tacts. HGMIS  staff  exchange  ideas  and  suggestions  with 
investigators,  industry  representatives,  and  others  when 
attending  occasional  scientific  conferences  and 
genome-related  meetings  and  displaying  the  DOE  Human 
Genome  Project  traveling  exhibit.  HGMIS  staff  also  make 
presentations  on  the  Human  Genome  Project  to  educa- 
tional, judicial,  and  other  groups. 

HGMIS  resources  serve  as  a  primary  source  for  the  popu- 
lar media  and  for  discipline-specific  publications  that 
broaden  the  distribution  of  genome  project  information  by 
extracting  and  reprinting  firom  HGMIS  resources  and  by 
linking  to  various  parts  of  the  HGMIS  Web  site. 

HGMIS  continuously  monitors  changes  in  the  direction  of 
the  Internationa]  Human  Genome  Project  and  searches  for 
ways  to  strengthen  the  content  relevancy  of  the  newsletter, 
the  Web  site,  and  other  services. 

DOE  Contract  No.  DE-AC05-96OR22464. 


Human  Genome  Program 
Coordination 

Sylvia  J.  Spengler 

Lawrence  Berkeley  National  Laboratory;  Berkeley  CA 

94720 

510/486-4879,  Fax:  -5717,  sjspengler@lbl.gov 

http://www.  lbl.gov/Education/ELSI 

The  DOE  Human  Genome  Program  of  the  Office  of 
Health  and  Environmental  Research  (OHER)  has  devel- 
oped a  number  of  tools  for  management  of  the  Program. 
Among  these  was  the  Human  Genome  Coordinating  Com- 
mittee (HGCC),  estabhshed  in  1988.  In  1996,  the  HGCC 
was  expanded  to  a  broader  vision  of  the  role  of  genomic 
technologies  in  OHER  programs,  and  the  name  was 
changed  to  reflect  this  broadening.  The  HGCC  is  now  the 
Biotechnology  Forum.  The  Forum  is  chaired  by  the  Asso- 
ciate Director,  OHER.  Members  of  the  Human  Genome 
Program  Management  Task  group  are  ex  officio  members, 
as  are  members  of  the  Health  and  Environmental  Research 
Advisory  Committee's  subcommittee  on  the  Human  Ge- 
nome. Responsibihties  of  the  Forum  include:  assisting 
OHER  in  overall  coordination  of  DOE-funded  genome 
research;  facilitating  the  development  and  dissemination  of 
novel  genome  technologies;  recommending  establishment 
of  ad  hoc  task  groups  in  specific  areas,  such  as  informatics. 
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technologies,  model  organisms;  and  evaluation  of  progress 
and  consideration  of  long-term  goals.  Members  also  serve 
on  the  Joint  DOE-NIH  Subcommittee  on  the  Human  ge- 
nome, for  interagency  coordination.  The  coordination 
group  also  participates  in  interface  programs  with  other 
facilities  and  provides  scientific  support  for  development 
of  other  OHER  goals,  as  requested. 

Support  of  Human  Genome  Program 
Proposal  Reviews 

Walter  Williams 

Education/Training  Division;  Oak  Ridge  Institute  for 
Science  and  Education;  Oak  Ridge,  TN  37831-01 17 
423/576-4811,  Fax:  /241-2727,  wittiamw@orau.gov 

The  Oak  Ridge  Institute  for  Science  and  Education 
(ORISE),  operated  by  Oak  Ridge  Associated  Universities, 
provides  assistance  to  the  DOE  Office  of  Health  and  Envi- 
ronmental Research  in  the  technical  review  of  proposals 
submitted  in  response  to  solicitations  by  the  DOE  Human 
Genome  Program.  ORISE  staff  members  create  and  main- 
tain a  database  of  all  proposal  information;  including  ab- 
stracts, relevant  names  and  addresses,  and  budget  data. 
This  information  is  compiled  and  presented  to  proposal 
reviewers.  Before  review  meetings,  ORISE  staff  members 
make  appropriate  hotel  and  meeting  arrangements,  provide 
each  reviewer  with  proposal  copies  and  evaluation  guide- 
lines, and  coordinate  reviewer  travel  and  honoraria  pay- 
ment Onsite  meeting  support  includes  collecting  all  re- 
viewer evaluation  forms  and  scores,  entering  reviewer 
scores  into  the  database,  preparing  appropriate  reports, 
providing  onsite  computer  support  and  handling  all  logis- 
tical issues.  Other  support  includes  assistance  with  pro- 
gram advertising  and  preparation  of  reviewer  comments 
following  each  review.  ORISE  may  also  assist  with  pre- 
and  post-review  activities  related  to  conferences,  seminars, 
and  site  visits. 

DOE  Contract  No.  DE-AC05-76OR00033. 


Infrastructure 

Former  Soviet  Union  Office  of  Health 
and  Environmental  Research  Program 

James  Wright 

Education/Training  Division;  Oak  Ridge  Institute  for 
Science  and  Education;  Oak  Ridge,  TN  37831-01 17 
423/576-1716,  Fax:  /241-2727,  wrightj@orau.gov 

The  Former  Soviet  Union  Office  of  Health  and  Environ- 
mental Research  Program,  sponsored  by  the  U.S.  Depart- 
ment of  Energy,  Office  of  Health  and  Environmental  Re- 
search, recognizes  outstanding  scientists  in  the  field  of 
health  and  environmental  research  from  the  independent 
states  of  the  former  Soviet  Union.  The  program  fosters  the 
international  exchange  of  new  ideas  and  innovative  ap- 
proaches in  health  and  environmental  research;  strengthens 
ties  and  encourages  continuing  collaboration  among  Rus- 
sians and  U.S.  scientists;  and  establishes  and  maintains 
environmental  research  capability  in  the  former  Soviet 
Union.  The  program  has  supported  more  than  23  Russian 
principal  investigators  and  approximately  1 10  other  re- 
search associates  in  Moscow,  St.  Petersburg,  and 
Novosibirsk.  More  importantly,  the  program  has  enabled 
many  high  quality  Russian  biological,  genome  informatics, 
physical  mapping  and  mutagenesis  detection,  human  ge- 
netics,, biochemistry,  DNA  sequencing  technology,  protein 
analysis,  molecular  genetics,  and  other  related  research 
infrastructures  to  continue  operating  in  an  lucertain  eco- 
nomic environment. 

DOE  Contract  No.  DE-AC05-76OR00033. 
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1996  Phase  I 

An  Engineered  RNA/DNA  Polymerase 
to  Increase  Speed  and  Ekronomy  of 
DNA  Sequencing 

Mark  W.  Knuth 

Promega  Corporation;  Madison,  WI  53711-5399 
608/274-4330,  Fax:  /277-2601 

DNA  sequence  information  is  the  cornerstone  for  consider- 
able experimental  design  and  analysis  in  the  biological 
sciences.  The  proposed  studies  will  focus  on  advancing 
DNA  sequencing  by  creating  a  new  enzyme  that  eliminates 
the  need  for  an  oligonucleotide  primer  to  initiate  DNA 
synthesis  at  a  defined  site,  and  that  can  use  dideoxy  nucle- 
otides for  chain  termination.  The  new  method  should  re- 
duce the  time  and  cost  required  to  obtain  DNA  sequences 
and  enhance  the  speed  and  cost  effectiveness  of  current 
DNA  sequencing  technologies.  Phase  I  studies  will  focus 
on  purifying  mutant  T7  RNA  polymerases  known  <o  incor- 
porate dNTPs  into  DNA  chains,  developing  protocols  for 
rapid  small  scale  mutant  enzyme  purification,  evaluating 
the  purified  mutants  for  properties  relevant  to  DNA  se- 
quencing, developing  facile  mutagenesis  schemes  and  pro- 
ducing mutant  RNA/DNA  polymerases  with  altered  pro- 
moter recognition.  The  results  from  phase  I  will  provide 
the  foundation  for  Phase  II  research,  which  will  focus  on 
refming  properties  of  the  mutant  by:  (1)  expanding  the 
niunber  of  mutations  examined  using  the  purification  pro- 
tocols, assays,  and  mutagenesis  screening  methods  devel- 
oped in  Phase  I  and  (2)  examining  the  effect  of  each  muta- 
tion on  enzymatic  properties  important  to  DNA  sequencing 
applications,  and  (3)  optimizing  conditions  for  sequencing 
performance.  In  Phase  III,  Promega  will  commercialize  the 
new  mutant  enzymes  through  its  own  extensive  distribu- 
tion network  and  by  collaborating  with  major  instrumenta- 
tion firms  to  adapt  the  technology  to  automated  DNA  se- 
quencing systems. 

DOE  Grant  No.  DE-FG02  96ER8226. 


Directed  Multiple  DNA  Sequencing  and 
Expression  Analysis  by  Hybridization 

Gualberto  Ruano 

BIOS  Laboratories,  Inc.;  New  Haven,  CT  06511 
800/678-9487  or  203/773-1450,  Fax:  800/315-7435  or 
203/562-9377 

The  overall  goal  of  this  project  is  to  develop  molecular 
resources  with  direct  applications  to  either  DNA  sequence 
analysis  or  gene  expression  analysis  in  multiplexed  for- 
mats using  sequential  hybridization  of  Peptide  Nucleic 
Acid  (PNA)  oligomer  probes.  PNA  oligomers  hybridize 
more  stably  and  specifically  to  cognate  DNA  targets  than 
conventional  DNA  oligonucleotides.  The  Phase  I  project 
discussed  here  is  concerned  with  development  of  PNA 
probe  technology  having  direct  application  either  to  the 
directed  sequencing  process  or  to  gene  expression  profil- 
ing. With  regard  to  directed  sequencing,  we  seek  improve- 
ments in  the  three  multiply  repeated  steps  associated  with 
this  process,  namely  (1)  probe  assembly,  (2)  sequencing 
reactions,  and  (3)  gel  electrophoresis.  In  PNA  hybridiza- 
tion sequencing,  sequences  are  generated  directly  from  the 
template  by  multiplex  DNA  sequencing  using  anchor 
primers  known  to  have  frequent  annealing  sites.  Electro- 
phoresis is  performed  en  masse  for  each  anchor  primer 
reaction,  blotted  to  nylon  membranes  and  individual  se- 
quences are  selectively  exposed  by  iterative  hybridization 
to  specific  8-mer  PNA  probes  derived  from  sequences  sta- 
tistically over-represented  in  expressed  DNA  and  obtained 
from  a  pre-synthesized  library.  Additionally,  the  same  PNA 
library  can  be  used  as  a  source  of  hybridization  probes  for 
querying  expression  patterns  of  specific  genes  in  any  cell 
line  or  tissue.  Specific  gene  expression  can  be  monitored 
by  coupling  gene-specific  RT-PCR  with  hybridization 
when  cDNA  products  are  separated  by  gel  electrophoresis 
and  blotted  to  nylon  membranes.  Patterns  of  gene  expres- 
sion are  then  resolved  by  hybridization  using  PNA  oligo- 
mers. Bands  corresponding  to  specific  genes  can  be 
deconvoluted  using  sequence  information  from  RT-PCR 
primers  and  PNA  probes.  Higher  throughput  expression 
analysis  can  be  achieved  by  multiplexed  gel  electrophore- 
sis, blotting  and  iterative  probing  of  RT-PCR  reactions 
with  individual  PNA  probes. 

DOE  Grant  No.  DE-FG02-96ER82I3. 
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1996  Phase  n 

A  Graphical  Ad  Hoc  Query  Interface 
Capable  of  Accessing  Heterogeneous 
Public  Genome  Databases 

Joseph  Leone 

CybeiConnect  Corporation;  Storrs,  CT  06268 
860/486-2783,  Fax:  7429-2372 

The  interoperability  of  public  genome  databases  is  ex- 
pected to  be  crucial  in  making  the  Human  Genome  Project 
a  success.  This  project  will  develop  software  tools  in 
which  users  in  the  genome  community  can  learn  or  exam- 
ine public  genome  database  schemes  in  a  relatively  short 
time  and  can  produce  a  correct  Structured  Query  Language 
(SQL)  expression  easily.  In  Phase  I,  a  concept  system  was 
constructed  and  the  effectiveness  of  formulating  ad  hoc 
queries  graphically  was  demonstrated.  Phase  U  will  focus 
on  transforming  the  concept  system  into  a  product  that  is 
robust  and  portable.  TWo  types  of  computer  programs  will 
be  developed.  One  is  a  client  program  which  is  to  be  dis- 
tributed to  community  users  who  intend  to  access  public 
genomic  databases  and  link  them  with  local  databases.  The 
other  is  a  server  program  and  a  suite  of  software  tools  de- 
signed to  be  used  by  those  genome  centers  which  intend  to 
make  their  databases  publicly  accessible. 

DOE  Grant  No.  DE-FG02-95ER81906. 


In  Phase  II  work  we  are  developing  an  instrument  which 
simultaneously  purifies  plasmid  DNA  from  up  to  192  (2 
X  96)  bacterial  samples  in  1 .5  hours.  Prototypes  of  this 
instrument  thus  far  constructed  have  allowed  the  purifi- 
cation of  3-7  micrograms  of  high  purity  plasmid  DNA 
per  lane  from  1 .5  ml  of  bacterial  culture.  We  have  at- 
tempted to  optimize  all  of  the:  instrument  electrophoretic 
run  parameters,  lysis  chemistry,  lysis  reagent  delivery 
devices,  reagent  storage  at  room  temperature,  desalting 
processes  and  overall  instrument  mechanical  and  elec- 
tronic control.  Instrument  prototypes  have  also  been 
used  to  prepare  cosmid  or  yeast  DNA  in  quantities  of  1- 
5  micrograms  per  cassette  lane.  Trials  thus  far  have 
yielded  plasmid  DNA  of  sufficient  purity  for  direct  use 
in  automated  fluorescent  and  manual  sequencing  as  well 
as  other  molecular  biology  protocols.  We  have  studied 
the  purity  of  the  resulting  DNA  when  directly  sequenced 
on  a  Licor  4000  Long  Reader  and  ABI  373A  automated 
DNA  sequencers.  Results  from  the  Licor  4000  instru- 
ment give  routine  read  lengths  of  >850  base  pairs  with 
98%  accuracy  while  ABI  373A  reads  generally  exceed 
400  base  pairs  with  similar  accuracy. 

The  proposed  2  X  96-channel  instrument  will  purify  up 
to  1200  plasmid  DNA  preps  per  eight  hour  day.  It  will 
significantly  reduce  the  cost  and  technician  labor  of  high 
throughput  plasmid  DNA  purification  for  automated  se- 
quencing and  mapping. 

DOE  Grant  No.  DE-FG03-94ER8 1 802/AOOO. 


Low-Cost  Automated  Preparation  of 
Plasmid,  Cosmid,  and  Yeast  DNA 

Tuyen  Nguyen,  Randy  F.  Sivila,  Joshua  P.  Dyer,  and 
WUliam  P.  MacConnell 

MacConnell  Research  Corporation;  San  Diego,  CA  92121 
619/452-2603,  Fax:  -6753 

MacConnell  Research  currently  manufactures  and  sells  a 
low  cost  automated  bench-top  instrument  that  can  purify 
up  to  24  samples  of  plasmid  DNA  simultaneously  in  one 
hour  at  a  cost  of  $0.65  per  sample  and  under  $8000  for  the 
instrument.  The  patented  instrument  uses  a  form  of  agar- 
ose gel  electrophoresis  to  purify  the  plasmid  DNA  and 
electroelutes  into  approximately  a  20  -i-l  volume.  The  in- 
strument has  many  advantages  over  other  robotic  and 
manual  methods  including  the  fact  that  is  it  two  times 
faster,  at  least  six  times  less  expensive,  much  smaller  in 
size,  easier  to  operate,  less  cost  per  sample,  and  results  in 
DNA  pure  enough  for  direct  use  in  fluorescent  automated 
sequencing.  The  instrument  process  begins  with  bacterial 
culture  which  is  loaded  directly  into  a  disposable  cassette 
in  the  machine. 


GRAIL-GenQuest:  A  Comprehensive 
Computational  Framework  for  DNA 
Sequence  Analysis 

Ruth  Ann  Manning 

ApoCom,  Inc.;  Oak  Ridge,  TN  37830 
423/482-2500,  Fax:  /220-2030 

Although  DNA  sequencing  in  the  Human  Genome 
Project  is  occurring  fairly  systematically,  biotechnology 
companies  have  focused  on  sequencing  regions  thought 
to  contain  particular  disease  genes.  The  client-server 
DNA  sequence  analysis  system  GRAIL  is  the  most  accu- 
rate and  widely  used  computer-based  system  for  locating 
and  characterizing  genes  in  DNA  sequences,  but  it  is  not 
accessible  to  many  biotechnology  environments.  The 
GRAIL  client  software  and  graphical  displays  have  been 
developed  for  high-end  UNIX-based  computer  worksta- 
tions. Such  workstations  are  standard  equipment  in  uni- 
versities and  large  companies,  but  personal  computers 
(PCs)  and  Macintosh  computers  are  the  prevalent  tech- 
nology within  the  biotechnology  commimity.  This 
Phase  I  project  will  design  Macintosh-  and  Wmdows-     , 
based  client  graphical  user  interface  prototypes  for 
GRAIL. 
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The  growth  of  DNA  databases  is  expected  to  continue  at  a 
fast  pace  in  the  attempt  to  sequence  the  human  genome 
completely  by  the  year  2005.  Parallel  processing  is  a  vi- 
able solution  to  handle  searching  through  the  ever-increas- 
ing volume  of  data.  During  Phase  I,  genQuest — the  se- 
quence comparison  server  portion  of  the  GRAIL  system — 
will  be  parallelized  for  shared-memory  platforms  and  will 
use  PVM'  for  the  development  of  genQuest  servers  on  net- 
works of  PCs  and  workstations  and  other  innovative,  high- 
performance  computer  architectures. 

Prototype  graphical  interface  systems  for  Macintosh,  NT 
Windows,  and  Windows  95  that  mimic  the  function  and 
operation  of  the  current  GRAIL -genQuest  clients  will  en- 


able a  larger  portion  of  biotechnology  companies  to  make 
use  of  the  GRAIL  suite  of  analysis  tools.  Parallel  genQuest 
servers  will  improve  response  time  for  searches  and  in- 
crease user  capacity  per  server  Such  fast  shared-  and  dis- 
tributed-memory  computing  solutions  will  improve  the 
cost-performance  ratio  and  make  parallel  searches  more 
affordable  to  the  biotechnology  community  using  general 
multipurpose  hardware. 

DOE  Grant  No.  DE-FG02-95ER8I923. 

'The  Parallel  Virtual  Machine  (PVM)  message-passing 
library  allows  a  collection  of  UNIX -based  computers  to 
function  as  a  single  multiple-processor  supercomputer 
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Projects  Completed  FY  1994-95 


Projects  in  this  section  have  been  completed  or  did  not  receive  support  through  the  DOE  Human  Genome  Program  in 
FY  1996. 


Sequencing 


Sequencing  by  Hybridization:  Methods  to  Generate 
Large  Arrays  of  Oligonucleotides 
Thomas  M.  Brennan 

Sequencing  by  Hybridization:  Development  of  an 
Efficient  Large-Scale  Methodology 
Radomir  Crkvenjakov 

Genomic  Instrumentation  Development:  Detection 
Systems  for  Film  and  High-Speed  Gel-Less  Methods 
Jack  B.  Davidson  and  Robert  S.  Foote 

Single-Molecule  Detection  Using  CSiarge-Coupled 
Device  Array  Technology 

M.  Bonner  Denton,  Richard  Keller.  Mark  E. 

Baker,  Colin  W.  Earle,  and  David  A.  Radspinner 

Coupling  Sequencing  by  Hybridization  with  Gel 
Sequencing  for  Inexpensive  Analysis  of  Genes  and 
Genomes 

Radoje  Drmanac,  Snezana  Drmanac,  and 

Ivan  Labat 

Physical  Structure  and  DNA  Sequence  of  Human 
Chromosomes 

Glen  A.  Evans 

Using  Scanning  Tunneling  Microscopy  to  Sequence 

the  Human  Genome 

Thomas  L.  Ferrell,  Robert  J.  Wannack, 
David  P.  Allison,  K.  Bruce  Jacobson, 
Gilbert  M.  Brown,  and  Thomas  G.  Thundat 

DNA  Sequence  Analysis  by  Solid-Phase  Hybridization 
Robert  S.  Foote,  Richard  A.  Sachleben,  and 
K.  Bruce  Jacobson 

DNA  Sequencing  Using  Stable  Isotopes 

K.  Bruce  Jacobson,  Heinrich  F.  Arlinghaus, 
Gilbert  M.  Brown,  Robert  S.  Foote, 
Frank  W.  Larimer,  Richard  A.  Sachleben, 
Norbert  Thonnard,  and  Richard  P.  Woychik 

Preparation  of  Oligonucleotide  Arrays  for  Hybridiza- 
tion Studies 

Michael  C.  Pirrung,  Steven  W.  Shuey, 
David  C.  Lever.  Lara  Fallon,  J.-C.  Bradley,  and 
WUliam  P.  Hawe 


Improvement  and  Automation  of  Ligation-Mediated 
Genomic  Sequencing 

Arthur  D.  Riggs  and  Gerd  F  Pfeifer 

♦Analysis  of  a  53-Kb  Nucleotide  Sequence  from  the 
Right  Genome  Terminus  of  the  Variola  Major  Virus 
Strain  India- 1967 

Sergei  N.  Shchelkunov,  Vladimir  M.  Blinov, 
Sergei  M.  Resenchuk,  Alexei  V.  Totmenin, 
Viktor  N.  Krasnykh,  Ludmilla  V.  Olenina, 
Oleg  I.  Serpinsky.  and  Lev  S.  Sandakhchiev 

A  High-Speed  Automated  DNA  Sequencer 
Lloyd  M.  Smith 

Characterization  and  Modification  of  DNA 
Polymerases  for  Use  in  DNA  Sequencing 
Stanley  Tabor 


Mapping 

♦Toward  Cloning  Human  Chromosome  19  in  Yeast 
Artificial  Chromosomes 

Ifiga  P.  Arman,  Alexander  B.  Devin,  Svetlana  P. 

Legchihna,  Irina  G.  Efimenko,  Marina  E. 

Smimova,  and  Dina  V.  Glazkova 

A  Panel  of  Mouse-Human  Monochromosomal 
Hybrid  Cell  Lines,  Each  Containing  a  Single  Differ- 
ent Tagged  Human  Chromosome 

Arbansjit  K.  Sandhu,  G.  Pal  Kaur,  and 

Raghbir  S.  Athwal 

♦Preparation  of  a  Set  of  Molecular  Markers  for 
Human  Chromosome  5  Using  G-t-C-Rich  and 
Functional  Site-Specific  Oligonucleotides 

M.L.  Filipenko,  A.I.  Muravlev,  E.I.  Jantsen, 
V.V.  Smimova.  N.A.  Chikaev,  V.P  Mishin,  and 
M.A.  Ivanovich 

An  Improved  Method  for  Producing  Radiation 
Hybrids  Applied  to  Human  Chromosome  19 

Cynthia  L.  Jackson  and  Hon  Fong  L.  Mark 
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Completed  Projects 

Construction  of  a  Human  Genome  Library  Com- 
posed of  Multimegabase  Acentric  Chromosome 
Fragments 

Michael  J.  Lane,  Peter  Hahn,  and  John  Hozier 

Reagents  for  Understanding  and  Sequencing    the 

Human  Genome 

J.R.  Korenberg,  X-N.  Chen,  S.  Mitchell, 

S.  Gerwehr,  Z.  Sun.  D.  Noya,  R.  Hubert, 

U-J.  Kim,  H.  Shizuya,  X.  Wu.  J.  Silva,  B.  Birren, 

T.J.  Hudson,  P.  de  Jong,  E.  Lander,  and  M.  Simon 


Chromosome  Mapping  by  FISH  to  Interphase  Nuclei 
Barbara  J.  Trask 

Flow  Karyotyping  and  Flow  Instrumentation  Devel- 
opment 

Ger  van  den  Engh  and  Barbara  Trask 

Isolation  of  Specific  Human  Telomeric  Clones  by 
Homologous  Recombination  and  YAC  Rescue 
Geoffrey  Wahl  and  Linnea  Brody 


Development  of  Diallelic  Marker  Maps  Using 
PCRyOLA 

Deborah  A.  Nickerson  and  Pui-Yan  Kwok 


Informatics 


Multiplex  Mapping  of  Human  cDNAs 

William  C.  Nierman,  Donna  R.  Maglott,  and 

Scott  Durkin 

Physical  Mapping  in  Preparation  for  DNA  Sequencing 
Andreas  Gnirke,  Regina  Lim,  Gane  Wong, 
Jun  Yu,  Roger  Bumgamer,  and  Maynard  dson 

Construction  of  a  Genetic  Map  Across  Chromosome  21 
Elaine  A.  Ostrander 

Integrated  Physical  Mapping  of  Human  cDNAs 
Mihael  H.  Polymeropoulos 

Sequence-Tagged  Sites  for  Human  Chromosome  19 
cDNAs 

Michael  J.  SicUiano  and  Anthony  V.  Carrano 

cDN  A/STS  Map  of  the  Human  Genome:  Methods 
Development  and  Applications  Using  Brain  cDNAs 

James  M.  Sikela,  Akbar  S.  Khan,  Arto  K. 

Orpana,  Andrea  S.  Wilcox,  Janet  A.  Hopkins,  and 

Tamara  J.  Stevens 

Physical  Structure  of  Human  Chromosome  21 
Cassandra  L.  Smith,  Denan  Wang, 
Kaoru  Yoshida,  Jesus  Sainz,  Carita  Fockler,  and 
Meite  Bremer 

Physical  Mapping  of  Human  Chromosome  1 6 

David  F.  Callen,  Sinoula  Apostolou,  Elizabeth 
Baker,  Helen  Kozman,  Sharon  A.  Lane, 
Julie  Nancarrow,  Hilary  A.  Phillips,  Scott  A. 
Whitmore,  Norman  A.  Doggett,  John  C.  Mulley, 
Robert  I.  Richards,  and  Grant  R.  Sutherland 


*A  Method  for  Direct  Sequencing  of  Diploid 
Genomes  on  Oligonucleotide  Arrays;  Theoretical 
Analysis  and  Computer  Modeling 
Alexander  B.  Chetverin 

Sampling-Based  Methods  for  the  Estimation  of  DNA 
Sequence  Accuracy 

Gary  Churchill  and  Betty  Lazareva 

Computer-Aided  Genome  Map  Assembly  with 
SIGMA  (System  for  Integrated  Genome  Map 
Assembly) 

Michael  J.  Cinkosky,  Michael  A.  Bridgers, 
William  M.  Barber,  Mohamad  Ijadi,  and 
James  W.  Fickett 

Informatics  for  the  Sequencing  by  Hybridization 
Project 

Aleksandar  Milosavljevic  and  Radomir 

Crkvei^akov 

Sequencing  by  Hybridization  Algorithms  and 
Computational  Tools 

Radoje  Drmanac  Ivan  Labat,  and 

Nick  Stavropoulos 

HGIR:  Information  Management  for  a  Growing  Map 
James  W.  Fickett,  Michael  J.  Cinkosky, 
Michael  A.  Bridgers,  Henry  T.  Brown,  Christian 
Buries,  Philip  E.  Hempfner,  Iran  N.  Lai,  Debra 
Nelson,  Robert  M.  Pecherer,  Doug  Sorenson, 
Peichen  H.  Sgro,  Robert  D.  Sutherland, 
Charles  D.  Troup,  and  Bonnie  C.  Yantis 
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Completed  Projects 


Identification  of  Genes  in  Anonymous  DNA 
Sequences 

Christopher  A.  Fields  and  Carol  A.  Soderlund 

Algorithms  in  Support  of  the  Human  Genome  Project 
Dan  Gusfield.  Jim  Knight,  Kevin  Murphy, 
Paul  Stelling,  Lushen  Wang,  Archie  Cobbs, 
Paul  Horton.  Richard  Karp,  and  Gene  Lawler 


Predicting  Future  Disease:  Issues  in  the  Develop- 
ment, Application,  and  Use  of  Tests  for  Genetic 
Disorders 

Ruth  E.  Bulger  and  Jane  E.  Fullarton 

HUGO  International  Yearbook:  Genetics,  Ethics, 
Law,  and  Society  (GELS) 

Alex  Capron  and  Bartha  Knoppers 


BISP:  VLSI  Solutions  to  Sequence-Comparison 
Problems 

Tim  Hunkapiller,  Leroy  Hood,  Ed  Chen,  and 

Michael  Waterman 


The  Human  Genome:  Science  and  the  Social  Conse- 
quences; Interactive  Exhibits  and  Programs  on  Ge- 
netics and  the  Human  Genome 
Charles  C.  Carlson 


Physical  Mapping  of  DNA  Molecules 
Richard  M.  Karp 

BIOSCI  Electronic  Newsgroup  Network  for  the 
Biological  Sciences 

David  KristofTerson 

Multiple  Alignment  and  Homolog  Sequence  Data- 
base Compilation 

Hwa  A.  Lim 

Applying  Machine  Learning  Techniques  to  DNA 

Sequence  Analysis 

Jude  W.  Shavlik,  Michiel  O.  Noordewier, 
Geoffrey  Towell,  Mark  Craven,  Andrew  Whitsitt. 
Kevin  Cherkauer,  and  Lorien  Pratt 

New  Approaches  to  Recognizing  Functional 
Domains  in  Biological  Sequences 
Gary  D.  Stormo 


International  Conference  Working  Group:  The  Social 
Costs  and  Medical  Benefits  of  Human  Genetic 
Information 

Betsy  Fader 

"Medicine  at  the  Crossroads" 

George  Page  and  Stefan  Moore 

Pilot  Senior  Research  Fellowship  Program:  Bioethi- 

cal  Issues  in  Molecular  Genetics 

Declan  Murphy  and  Claudette  Cyr  Friedman 

Studies  of  Genetic  Discrimination 
Marvin  Natowicz 

DNA  Banking  and  DNA  Data  Banking:  Legal, 
Ethical,  and  Public  Policy  Issues 
Philip  Reilly 

Mechanical  Interactive  Exhibits  on  Biotechnology 
Elizabeth  Sharpe 


ELSI 


Protecting  Genetic  Privacy  by  Regulating  the 
Collection,  Analysis,  Use,  and  Storage  of  DNA  and 
Information  Obtained  from  DNA  Analysis 

George  J.  Annas.  Leonard  H.  Glaotz,  and 

Patricia  A.  Roche 

"The  Secret  of  Life" 

Paula  Apsell  and  Graham  Chedd 

Genome  Technology  and  Its  Implications:  A 
Hands-On  Workshop  for  Educators 

Diane  Baker  and  Paula  Gregory 


Impact  of  Technology  Derived  from  the  Human 
Genome  Project  on  Genetic  Testing,  Screening,  and 
Counseling:  Cultural,  Ethical,  and  Legal  Issues 

Ralph  W.  TVottier,  Lee  A.  Crandall. 

David  Phoenix,  Mwalimu  Imara,  and 

Ray  E.  Mosley 

Social  Science  Concepts  and  Studies  of  Privacy: 
A  Comprehensive  Inventory  and  Analysis  for 
Considering  Privacy,  Confidentiality,  and  Access 
Issues  in  the  Use  of  Genetic  Tests  and  Applications  of 
Genetic  Data 

Alan  F.  Westin 
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Completed  Projects 

Human  Genetics  and  Genome  Analysis:  A  Practical 
Workshop  for  Public  Policymakers  and  Opinion 
Leaders 

Jan  Witkowski,  David  A.  MickJos,  and 

Margaret  Henderson 


SBIR  Phase  I 

A  Graphical  Ad  Hoc  Query  Interface  Capable  of  Ac- 
cessing Heterogenous  Public  Genome  Databases 
J.  Clarke  Anderson 

Techniques  for  Screening  Large-Insert  Libraries 
Saika  Aytay 

Interactive  DNA  Sequence  Processing  for  a  Micro- 
computer 

Wayne  Dettloff  and  Holt  Anderson 

High-Performance  Searching  and  Pattern  Recogni- 
tion for  Human  Genome  Databases 
Douglas  J.  Eadline 

Estimating,  Encoding,  and  Using  Uncertainties  in  Se- 
quence Data 

John  R.  Hartman 

Low-Cost  Massively  Parallel  Neurocomputing  for 
Pattern  Recognition  in  Macromolecular  Sequences 
John  R.  Hartman 

Electrophoretic  Separation  of  DNA  Fragments  in  Ul- 
trathin  Planar-Format  Linear  Polyacrylamide 

Michael  T.  MacDonell  and  Darlene  B.  Roszak 

An  Acoustic  Plate  Mode  DNA  Biosensor 
Douglas  J.  McAllister 

Piezoelectric  Biosensor  Using  Peptide  Nucleic  Acids 
for  Triplex  Capture 

Douglas  McAllister 

Pedigree  Software  for  the  Presentation  of  Human  Ge- 
nome Information  for  Genetic  Education  and  Coun- 
seling 

Charles  L.  Manske 


A  High-Spatial-Resolution  Spectrograph  for  DNA 
Sequencing 

Cathy  D.  Newman 

Nonradioactive  Detection  Systems  Based  on 
Enzyme-Fragment  Complementation 
Peter  Richterich 

Separation  Media  for  DNA  Sequencing 

David  S.  Soane  and  Herbert  H.  Hooper 


SBIR  Phase  II 


Increased  Speed  in  DNA  Sequencing  by  Utilizing 
LARIS  and  SIRIS  to  Localize  Multiple  Stable 
Isotope-Labeled  Fragments 

Heinrich  F.  Arlinghaus 

Rapid,  High-Throughput  DNA  Sequencing  Using 
Confocal  Fluorescence  Imaging  of  Capillary  Arrays 
David  L.  Barker  and  Jay  Flatley 

Spatially  Defined  Oligonucleotide  Arrays 
Stephen  P.  A.  Fodor 

Site-Specific  Endonucleases  for  Human  Genome 

Mapping 

George  Golumbeski.  Kimberly  Knoche, 
Susanne  Selman,  im  Hartnett,  Lydia  Hung,  and 
Peter  Bayne 

High-Performance  DNA  and  Protein  Sequence 
Analysis  on  a  Low-Cost  Parallel-Processor  Array 
John  R.  Hartman  and  David  L.  Solomon 

Chemiluminescent  Multiprimed  DNA  Sequencing 
Chris  S.  Martin,  Corinne  E.  M.  Olesen,  and 
Irena  Brotistein 
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Appendix 
Narratives  from  Large,  Multidisciplinary  Research  Projects 


Part  1  of  this  report  contains  narratives  that  represent  DOE  Human  Genome  Program  research  in  large, 
multidisciplinary  projects.  As  a  convenience  to  the  reader,  these  narratives  are  reprinted  without  graphics  in  tliis 
appendix.  Only  the  contact  persons  for  these  organizations  are  listed  in  the  Index  to  Principal  and  Coinvesti- 
gators.  To  obtain  more  information  on  research  carried  out  in  these  projects,  see  their  contact  information  or 
visit  the  Web  sites  listed  with  the  narratives. 


Joint  Genome  Institute 72 

Elbert  Branscomb 

Lawrence  Livermore  National  Laboratory  Human  Genome  Center 73 

Anthony  V.  Cairano 

Los  Alamos  National  Laboratory  Center  for  Human  Genome  Studies 77 

Lairy  L.  Deaven 

Lawrence  Berkeley  National  Laboratory  Human  Genome  Center 81 

Mohandas  Narla 

University  of  Washington  Genome  Center 85 

Maynaid  Olson 

Genome  Database - 87 

Stanley  Letovsky  and  Robert  Cottingham 

National  Center  for  Genome  Resources 9i 

Peter  Schad 


DOE  Human  OeaftPiinii«|pia<MwtrWi»tmlimH>M««»tB,«wf«ihi         71 


313 

Joint  Genome  Institute 

Genome  Center  Sequencing  Efforts  Merge 


Lawrence  LIvermore  National  Laboratory 
7000  East  Avenue,  L-452 
LIvermore,  CA  94551 


Elbert  Branscomb,  JGI  Scientific  Director 

510/422-5681 

elbert@alu.llnl.gov  or  elbert@shotgun.llnl.gov 

http://www.lgl.  doe.gov 


In  a  major  restructuring  of  its  Human  Genome  Program, 
on  October  23,  1996.  the  DOE  Office  of  Biological  and 
Environmental  Research  established  the  Joint  Genome 
Institute  (JGI)  to  integrate*  work  based  at  its  three  major 
human  genome  centers. 

The  JGI  merger  represents  a  shift  toward  large-scale  se- 
quencing via  intensified  collaborations  for  more  effective 
use  of  the  unique  expertise  and  resources  at  Lawrence 
Berkeley  National  Laboratory  (LBNL),  Lawrence 
Livermore  National  Laboratory  (LLNL),  and  Los  Alamos 
National  Laboratory.  Elbert  Branscomb  (LLNL)  serves  as 
JGI's  Scientific  Director  Capital  equipment  has  been  or- 
dered, and  operational  support  of  about  $30  million  is 
projected  for  the  1998  fiscal  year. 

With  easy  access  to  both  LBNL  and  LLNL,  a  building  in 
Walnut  Creek,  California,  is  being  modified.  Here,  start- 
ing in  late  FY  1998,  production  DNA  sequencing  will  be 
carried  out  for  JGI.  Until  that  time,  large-scale  sequencing 
will  continue  at  LANL,  LBNL,  and  LLNL.  Expectations 
are  that  within  3  to  4  years  the  Production  Sequencing 
Facility  will  house  some  200  researchers  and  technicians 
working  on  high-throughput  DNA  sequencing  using 
state-of-the-art  robotics. 

Initial  plans  are  to  target  gene-rich  regions  of  around  1  to 
10  megabases  for  sequencing.  Considerations  include 
gene  density,  gene  families  (especially  clustered  families), 
correlations  to  model  organism  results,  technical  capabili- 
ties, and  relevance  to  the  DOE  mission  (e.g.,  DNA  repair, 
cancer  susceptibility,  and  impact  of  genotoxins).  The  JGI 
program  is  subject  to  regular  peer  review. 

Sequence  data  will  be  posted  daily  on  the  Web;  as  the  in- 
formation progresses  to  finished  quality,  it  will  be  submit- 
ted to  public  databases. 

As  JGI  and  other  investigators  involved  in  the  Human  Ge- 
nome Project  are  beginning  to  reveal  the  DNA  sequence 
of  the  3  billion  base  pairs  in  a  reference  human  genome, 
the  data  already  are  becoming  valuable  reagents  for 


explorations  of  DNA  sequence  function  in  the  body,  some- 
times called  "functional  genomics."  Although  large-scale 
sequencing  is  JGI's  major  focus,  another  important  goal 
will  be  to  enrich  the  sequence  data  with  information  about 
its  biological  function.  One  measure  of  JGI's  progress  will 
be  its  success  at  working  with  other  DOE  laboratories, 
genome  centers,  and  non-DOE  academic  and  industrial 
collaborators.  In  this  way,  JGI's  evolving  capabilities  can 
both  serve  and  benefit  from  the  widest  array  of  partners. 


Production  DNA  Sequencing  Begun 
Worldwide 

The  year  1 996  marked  a  transition  to  the  final  and  most 
challenging  phase  of  the  U.S.  Human  Genome  Project,  as 
pilot  programs  aimed  at  refining  large-scale  sequencing 
strategies  and  resources  were  funded  by  DOE  and  NIH 
(see  Research  Highlights,  DNA  Sequencing,  p.  14).  Inter- 
nationally, large-scale  human  genome  sequencing  was 
kicked  off  in  late  1995  when  The  Wellcome  Trust  an- 
nounced a  7-year,  $75-million  grant  to  the  private  Sanger 
Centre  to  scale  up  its  sequencing  capabilities.  French  in- 
vestigators also  have  announced  intentions  to  begin  pro- 
duction sequencing. 

Funding  agencies  worldwide  agree  that  rapid  and  free  re- 
lease of  data  is  critical.  Other  issues  include  sequence  ac- 
curacy, types  of  annotation  that  will  be  most  useful  to  bi- 
ologists, and  how  to  sustain  the  reference  sequence. 

The  international  Human  Genome  Organisation  maintains 
a  Web  page  to  provide  information  on  current  and  future 
sequencing  projects  and  links  to  sites  of  participating 
groups  (http://hugo.gdb.org).  The  site  also  links  to  reports 
and  resources  developed  at  the  February  1996  and  1997 
Bermuda  meetings  on  large-scale  human  genome  sequenc- 
ing, which  were  sponsored  by  The  Wellcome  Trust. 
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Research  Narratives 
Lawrence  Livermore  National  Laboratory  Human  Genome  Center 


Human  Genome  Center 

Lawrence  Livermore  National  Laboratory 

Bioiogy  and  Biotechnology  Researcli  Program 

7000  East  Avenue,  L-452 

Livermore,  CA  94551 


Anthony  V.  Carrano,  Director 

510/422-5698,  Fax:  /423-3110,  carrano1@llnl.gov 

Linda  Ashworth,  Assistant  to  Center  Director 
510/422-5665,  Fax:  -2282,  ashwortM@llnl.gov 

http-J/www-blo.llnl.gov/bbrp/genome/genome.html 


The  Human  Genome  Center  at  Lawrence  Livermore  Na- 
tional Laboratory  (LLNL)  was  established  by  DOE  in 
1991.  The  center  operates  as  a  multidisciplinaiy  team 
whose  broad  goal  is  understanding  human  genetic  mate- 
rial. It  brings  together  chemists,  biologists,  molecular  bi- 
ologists, physicists,  mathematicians,  computer  scientists, 
and  engineers  in  an  interactive  research  environment  fo- 
cused on  mapping,  DNA  sequencing,  and  characterizing 
the  human  genome. 

Goals  and  Priorities 

In  the  past  2  years,  the  center's  goals  have  undergone  an 
exciting  evolution.  This  change  is  the  result  of  several  fac- 
tors, both  intrinsic  and  extrinsic  to  the  Human  Genome 
Project.  They  include:  (1)  successful  completion  of  the 
center's  first-phase  goal,  namely  a  high-resolution, 
sequence-ready  map  of  human  chromosome  19;  (2)  ad- 
vances in  DNA  sequencing  that  allow  accelerated  scaleup 
of  this  operation;  and  (3)  development  of  a  strategic  plan 
for  LLNL's  Biology  and  Biotechnology  Research  Program 
that  will  integrate  the  center's  resources  and  strengths  in 
genomics  with  programs  in  structural  biology,  individual 
susceptibility,  medical  biotechnology,  and  microbial  bio- 
technology. 

The  primary  goal  of  LLNL's  Human  Genome  Center  is  to 
characterize  the  mammalian  genome  at  optimal  resolution 
and  to  provide  information  and  material  resources  to  other 
in-house  or  collaborative  projects  that  allow  exploitation 
of  genomic  biology  in  a  synergistic  matmer.  DNA  se- 
quence information  provides  the  biological  driver  for  the 
center's  priorities: 

•  Generation  of  highly  accurate  sequence  for  chromo- 
some 19. 

•  Generation  of  highly  accurate  sequence  for  genomic 
regions  of  high  biological  interest  to  the  mission  of 
the  DOE  Office  of  Biological  and  Environmental  Re- 
search (e.g.,  genes  involved  in  DNA  repair,  replica- 
tion, recombination,  xenobiotic  metabolism,  and  cell- 
cycle  control). 

•  Isolation  and  sequence  of  the  full  insert  of  cDNA 
clones  associated  with  genomic  regions  being  se- 
quenced. 


Sequence  of  selected  corresponding  regions  of  the 
mouse  genome  in  parallel  with  the  human. 

Annotation  and  position  of  the  sequenced  clones  with 
physical  landmarks  such  as  linkage  markers  and  se- 
quence tagged  sites  (STSs). 

Generation  of  mapped  chromosome  1 9  and  other  ge- 
nomic clones  (cosmids,  bacterial  artificial  chromo- 
somes (BACs),  and  PI  artificial  chromosomes  (PACs)] 
for  collaborating  groups. 

Sharing  of  technology  with  other  groups  to  minimize 
duplication  of  effort. 

Support  of  downstream  biology  projects,  for  example, 
structural  biology,  functional  studies,  human  variation, 
transgenics,  medical  biotechnology,  and  microbial  bio- 
technology with  know-how,  technology,  and  material 
resources. 


Center  Organization  and  Actitdties 

Completion  and  publication  of  the  metric  physical  map  of 
human  chromosome  19  in  1995  has  led  to  consolidation  of 
many  functions  associated  with  physical  mapping,  with  in- 
creased emphasis  on  DNA  sequencing.  The  center  is  orga- 
nized into  five  broad  areas  of  research  and  support:  se- 
quencing, resources,  functional  genomics,  informatics  and 
analytical  genomics,  and  instrumentation.  Each  area  con- 
sists of  multiple  projects,  and  extensive  interaction  occurs 
both  within  and  among  projects. 

Sequencing 

The  sequencing  group  is  divided  into  several  subprojects. 
The  core  team  is  responsible  for  the  construction  of  se- 
quence libraries,  sequencing  reactions,  and  data  collection 
for  all  templates  in  the  random  phase  of  sequencing.  The 
finishing  team  works  with  data  produced  by  the  core  team 
to  prodtjce  highly  redundant,  highly  acciu^te  "finish"  se- 
quence on  targets  of  interest.  Finally,  a  team  of  researchers 
focuses  specifically  on  development,  testing,  and  imple- 
mentation of  new  protocols  for  the  entire  group,  with  an 
emphasis  on  improving  the  efficiency  and  cost  basis  of  the 
sequencing  operation. 
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Resources 

The  resources  group  provides  mapped  clonal  resources  to 
the  sequencing  teams.  This  group  performs  physical  map- 
ping as  needed  for  the  DNA  sequencing  group  by  using 
fingerprinting,  restriction  mapping,  fluorescence  in  situ 
hybridization,  and  other  techniques.  A  small  mapping  ef- 
fort is  under  way  to  identify,  isolate,  and  characterize  BAC 
clones  (from  anywhere  in  the  human  genome)  that  relate  to 
susceptibility  genes,  for  example,  DNA  repair.  These 
clones  will  be  characterized  and  provided  for  sequencing 
and  at  the  same  time  contribute  to  understanding  the  biol- 
ogy of  the  chromosome,  the  genome,  and  susceptibility 
factors.  The  mapping  team  also  collaborates  with  others 
using  the  chromosome  19  map  as  a  resource  for  gene  hunt- 
ing. 

Functional  Genomics 

The  functional  genomics  team  is  responsible  for  assem- 
bling and  characterizing  clones  for  the  Integrated  Molecu- 
lar Analysis  of  Gene  Expression  (called  IMAGE)  Consor- 
tium and  cDNA  sequencing,  as  well  as  for  work  on  gene 
expression  and  comparative  mouse  genomics.  The  effort 
emphasizes  genes  involved  in  DNA  repair  and  links 
strongly  to  LLNL's  gene-expression  and  structural  biology 
efforts.  In  addition,  this  team  is  working  closely  with  Oak 
Ridge  National  Laboratory  (ORNL)  to  develop  a  compara- 
tive map  and  the  sequence  data  for  mouse  regions  syntenic 
to  human  chromosome  19. 

Informatics  and  Analytical  Genomics 

The  informatics  and  analytical  genomics  group  provides 
computer  science  support  to  biologists.  The  sequencing 
informatics  team  works  directly  with  the  DNA  sequencing 
group  to  facilitate  and  automate  sample  handing,  data  ac- 
quisition and  storage,  and  DNA  sequence  analysis  and  an- 
notation. The  analytical  genomics  team  provides  statistical 
and  advanced  algorithmic  expertise.  Tasks  include  devel- 
opment of  model-based  methods  for  data  capture,  signal 
processing,  and  feature  extraction  for  DNA  sequence  and 
fingerprinting  data  and  analysis  of  the  effectiveness  of 
newly  proposed  methods  for  sequencing  and  mapping. 

Instrumentation 

The  instrumentation  group  also  has  multiple  components. 
Group  members  provide  expertise  in  instrumentation  and 
automation  in  high-throughput  electrophoresis,  preparation 
of  high-density  replicate  DNA  and  colony  filters,  fluores- 
cence labeling  technologies,  and  automated  sample  han- 
dling for  DNA  sequencing.  To  facilitate  seamless  integra- 
tion of  new  technologies  into  production  use,  this  group  is 
coupled  tightly  to  the  biologist  user  groups  and  the 
informatics  group. 


Collaborations 

The  center  interacts  extensively  with  other  efforts  within 
the  LLNL  Biology  and  Biotechnology  Research  Program 
and  with  other  programs  at  LLNL,  the  academic  commu- 
nity, other  research  institutes,  and  industry.  More  than  250 
collaborations  range  ftom  simple  probe  and  clone  sharing 
to  detailed  gene  famUy  studies.  The  following  list  reflects 
some  major  collaborations. 

•  Integration  of  the  genetic  map  of  human  chromo- 
some 19  with  corresponding  mouse  chromosomes 
(ORNL). 

•  Miniaturized  polymerase  chain  reaction  instrumenta- 
tion (LLNL). 

•  Sequencing  of  IMAGE  Consortium  cDNA  clones 
(Washington  University,  St.  Louis). 

•  Mapping  and  sequencing  of  a  gene  associated  with 
Finnish  congenital  nephrotic  syndrome  (University  of 
Oulu,  Finland). 


Accomplishments 

The  LLNL  Human  Genome  Center  has  excelled  in  several 
areas,  including  comparative  genomic  sequencing  of  DNA 
repair  genes  in  human  and  rodent  species,  construction  of 
a  metric  physical  map  of  human  chromosome  19,  and  de- 
velopment and  application  of  new  biochemical  and  math- 
ematical approaches  for  constructing  ordered  clone  maps. 
These  and  other  major  accomplishments  are  highlighted 
below. 

•  Completion  of  highly  accurate  sequencing  totaling 
1.6  million  bases  of  DNA,  including  regions  spanning 
human  DNA  repair  genes,  the  candidate  region  for  a 
congenital  kidney  disease  gene,  and  other  regions  of 
biological  interest  on  chromosome  19. 

•  Completion  of  comparative  sequence  analysis  of 
107,500  bases  of  genomic  DNA  encompassing  the 
human  DNA  repair  gene  ERCC2  and  the  correspond- 
ing regions  in  mouse  and  hamster.  In  addition  to 
ERCC2,  analysis  revealed  the  presence  of  two  previ- 
ously undescribed  genes  in  all  three  species.  One  of 
these  genes  is  a  new  member  of  the  kinesin  motor  pro- 
tein family.  These  proteins  play  a  wide  variety  of  roles 
in  the  cell,  including  movement  of  chromosomes  be- 
fore cell  division. 

•  Complete  sequencing  of  human  genomic  regions  con- 
taining two  additional  DNA  repair  genes.  One  of 
these,  XRCa,  maps  to  human  chromosome  14  and 
encodes  a  protein  that  may  be  required  for  chromo- 
some stability.  Analysis  of  the  genomic  sequence  ' 
identified  another  kinesin  motor  protein  gene  physi- 
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cally  linked  to  XRCCi.  The  second  human  repair 
gene,  HHR23K,  maps  to  19pI3.2.  Sequence  analysis 
of  1 10,000  bases  containing  HHR23A  identified  six 
other  genes,  five  of  which  are  new  genes  with  similar- 
ity to  proteins  from  mouse,  human,  yeast,  and 
Caenorhahditis  elegans. 

Complete  sequencing  of  full-length  cDNAs  for  three 
new  DNA  repair  genes  (XRCCI.  XRCCi,  and 
XRCC9)  in  collaboration  with  the  LLNL  DNA  repair 
group. 

Generation  of  a  metric  physical  map  of  chromo- 
some 19  spanning  at  least  95%  of  the  chromosome. 
This  unique  map  incorporates  a  metric  scale  to  esti- 
mate the  distance  between  genes  or  other  markers  of 
interest  to  the  genetics  community. 

Assembly  of  nearly  45  million  bases  of  EcoR  I  restric- 
tion-mapped cosmid  contigs  for  human  chromo- 
some 19  using  a  combination  of  fingerprinting  and 
cosmid  walking.  Small  gaps  in  cosmid  continuity  have 
been  spanned  by  BAC,  PAC,  and  PI  clones,  which  are 
then  integrated  into  the  restriction  maps.  The  high 
depth  of  coverage  of  these  maps  (average  redundancy, 
4.3-fold)  permits  selection  of  a  minimum  overlapping 
set  of  clones  for  DNA  sequencing. 

Placement  of  more  than  400  genes,  genetic  markers, 
and  other  loci  on  the  chromosome  19  cosmid  map. 
Also,  165  new  STSs  associated  with  premapped 
cosmid  contigs  were  generated  and  added  to  the 
physical  map. 

Collaborations  to  identify  the  gene  (COMP)  respon- 
sible for  two  allelic  genetic  diseases,  pseudoachondro- 
plasia  and  multiple  epiphyseal  dysplasia,  and  the  iden- 
tification of  specific  mutations  causing  each  condi- 
tion. 

Through  sequence  analysis  of  the  2A  subfamily  of  the 
human  cytochrome  P450  enzymes,  identification  of  a 
new  variant  that  exists  in  10%  to  20%  of  individuals 
and  results  in  reduced  ability  to  metabolize  nicotine 
and  the  antiblood-clotting  drug  Coumadin. 

Location  of  a  zinc  finger  gene  that  encodes  a  tran- 
scription factor  regulating  blood-cell  development 
adjacent  to  telomere  repeat  sequences,  possibly  die 
gene  nearest  one  end  of  chromosome  19. 

Completion  of  the  genomic  and  cDNA  sequence  of 
the  gene  for  the  human  Rieske  Fe-S  protein  involved 
in  mitochondrial  respiration. 

Expansion  of  the  mouse-human  comparative 
genomics  collaboration  with  ORNL  to  include  study 
of  new  groups  of  clustered  transcription  factors  found 
on  human  chromosome  19q  and  as  syntonic  homologs 
on  mouse  chromosome  7. 


Numerous  collaborations  (in  particular,  with  Washing- 
ton University  and  Merck)  continuing  to  expand  the 
LLNL-based  IMAGE  Consortium,  an  effort  to  charac- 
terize the  transcribed  human  genome.  The  IMAGE 
clone  collection  is  now  the  largest  public  collection  of 
sequenced  cDNA  clones,  with  more  than  500,000  ar- 
rayed clones,  500,000  sequences  in  public  databases, 
and  10,000  mapped  cDNAs. 

Development  and  deployment  of  a  comprehensive 
system  to  handle  sample  tracking  needs  of  production 
DNA  sequencing.  The  system  combines  databases  and 
graphical  interfaces  running  on  both  Mac  and  Sun 
platforms  and  scales  easily  to  handle  large-scale  pro- 
duction sequencing. 

Expansion  of  the  LLNL  genome  center's  World  Wide 
Web  site  to  include  tables  that  link  to  each  gene  being 
sequenced,  to  the  quality  scores  and  assembled  bases 
collected  each  night  during  the  sequencing  process, 
and  to  the  submitted  GenBank  sequence  when  a  clone 
is  completed,  [http://bbrp.llnl.gov/test-bin/ 
projqc.iummary] 

Implementation  of  a  new  database  to  support  sequenc- 
ing and  mapping  work  on  multiple  chromosomes  and 
species.  Web-based  automated  tools  were  developed 
to  facilitate  construction  of  this  database,  the  loading 
of  over  100  million  bytes  of  chromosome  19  data 
from  the  existing  LLNL  database,  and  automated  gen- 
eration of  Web-based  input  interfaces. 

Significant  enhancement  of  the  LLNL  Genome 
Graphical  Database  Browser  software  to  display  and 
link  information  obtained  at  a  subcosmid  resolution 
from  both  restriction  map  hybridization  and  sequence 
feature  data.  Features,  such  as  genes  linked  to  dis- 
eases, allow  tracking  to  fragments  as  small  as  500 
base  pairs  of  DNA. 

Development  of  advanced  microfabrication  technolo- 
gies to  produce  electrophoresis  microchannels  in  large 
glass  substrates  for  use  in  DNA  sequencing. 

Installation  of  a  new  filter-spotting  robot  that  routinely 
produces  6  x  6  x  384  filters.  A  16  x  16  x  384  pattern 
has  been  achieved. 

Upgrade  of  the  Lawrence  Berkeley  National  Labora- 
tory colony  picker  using  a  second  computer  so  that 
imaging  and  picking  can  occur  simultaneously. 


Future  Plans 

Genomic  sequencing  currently  is  the  dominant  function  of 
Livermore's  Human  Genome  Center.  The  physical  map- 
ping effort  will  ensure  an  ample  supply  of  sequence-ready 
clones.  For  sequencing  targets  on  chromosome  19,  this 
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includes  ensuring  that  the  most  stable  clones  (cosmids, 
BACs,  and  PACs)  are  available  for  sequencing  and  that 
regions  with  such  known  physical  landmarks  as  STSs  and 
expressed  sequenced  tags  (ESTs)  are  annotated  to  facilitate 
sequence  assembly  and  analysis.  The  following  targets  are 
emphasized  for  DNA  sequencing: 

•  Regions  of  high  gene  density,  including  regions  con- 
taining gene  families. 

•  Chromosome  19,  of  which  at  least  42  million  bases 
are  sequence  ready. 

•  Selected  BAC  and  PAC  clones  representing  regions  of 
about  0.2  million  to  I  million  bases  throughout  the 
human  genome;  clones  would  be  selected  based  on 
such  liigh-priority  biological  targets  as  genes  involved 
in  DNA  repair,  replication,  recombination,  xenobiotic 
metabolism,  cell-cycle  checkpoints,  or  other  specific 
targets  of  interest. 

•  Selected  BAC  and  PAC  clones  from  mouse  regions 
syntenic  with  the  genes  indicated  above. 

•  Full-insert  cDNAs  corresponding  to  the  genomic 
DNA  being  sequenced. 

The  informatics  team  is  continuing  to  deploy  broader- 
based  supporting  databases  for  both  mapping  and  sequenc- 
ing. Where  appropriate,  Web-  and  Java-based  tools  are  be- 
ing developed  to  enable  biologists  to  interact  with  data. 
Recent  reorganization  within  this  group  enables  better  di- 
rect support  to  the  sequencing  group,  including  evaluating 
and  interfacing  sequence-assembly  algorithms  and  analysis 
tools,  data  and  process  tracking,  and  other  informatics 
functions  that  will  streamline  the  sequencing  process. 

The  instrumentation  effort  has  three  major  thrusts:  (1)  con- 
tinued development  or  implementation  of  laboratory  auto- 
mation to  support  high-throughput  sequencing:  (2)  devel- 
opment of  the  next-generation  DNA  sequencer;  and  (3)  de- 
velopment of  robotics  to  support  high-density  BAC  clone 
screening.  The  last  two  goals  warrant  further  explanation. 

The  new  DNA  sequencer  being  developed  under  a  grant 
from  the  National  Institutes  of  Health,  with  minor  support 
through  the  DOE  genome  center,  is  designed  to  run  384 


lanes  simultaneously  with  a  low-viscosity  sieving  medium. 
The  entire  system  would  be  loaded  automatically,  run,  and 
set  up  for  the  next  run  at  3-hour  intervals.  If  successful,  it 
should  provide  a  20-  to  40-fold  increase  in  throughput  over 
existing  machines. 

An  LLNL-designed  high-precision  spotting  robot,  which 
should  allow  a  density  of  98,304  spots  in  96  cm^,  is  now 
operating.  The  goal  of  this  effort  is  to  create  high-density 
filters  representing  a  I  Ox  BAC  coverage  of  both  human 
and  mouse  genomes  (30,000  clones  =  Ix  coverage).  Thus 
each  filter  would  provide  -3x  coverage,  and  eight  such 
filters  would  provide  the  desired  coverage  for  bofli  ge- 
nomes. The  filters  would  be  hybridized  with  amplicons 
fhim  individual  or  region-specific  cDNAs  and  ESTs;  given 
the  density  of  the  BAC  libraries,  clones  that  hybridize 
should  represent  a  binned  set  of  BACs  for  a  region  of  in- 
terest. These  BACs  could  be  the  initial  substrate  for  a  BAC 
sequencing  strategy.  Performing  hybridizations  in  parallel 
in  mouse  and  human  DNA  facilitates  the  development  of 
the  mouse  map  (with  ORNL  involvement),  and  sequencing 
BACs  from  both  species  identifies  evolutionarily  con- 
served and,  perhaps,  regulatory  regions. 

Information  generated  by  sequencing  human  and  mouse 
DNA  in  parallel  is  expected  to  expand  LLNL  efforts  in 
functional  genomics.  Comparative  sequence  data  will  be 
used  to  develop  a  high-resolution  synteny  map  of  con- 
served mouse-human  domains  and  incorporate  automated 
northern  expression  analysis  of  newly  identified  genes. 
Long  range,  the  center  hopes  to  take  advantage  of  a  variety 
of  forms  of  expression  analysis,  including  site-directed 
mutation  analysis  in  the  mouse. 

Summary 

The  Livermore  Human  Genome  Center  has  undergone  a 
dramatic  shift  in  emphasis  toward  commitment  to 
large-scale,  high-accuracy  sequencing  of  chromosome  19, 
other  chromosomes,  and  targeted  genomic  regions  in  the 
human  and  mouse.  The  center  also  is  committed  to  exploit- 
ing sequence  information  for  functional  genomics  studies 
and  for  other  programs,  both  in  house  and  collaboratively. 


76        DOE  Humin  Qwtom*  Program  Rapert,  Pirt  2, 


larch  Abstracts 


318 


Research  Narratives 
Los  Alamos  National  Laboratory  Center  for  Human  Genome  Studies 


Center  for  Human  Genome  Studies 
Los  Alamos  National  Laboratory 
P.O.  Box  1663 
Los  Alamos,  NM  87545 

Robert  K.  Moyzis,  Director,  1989-97* 


'Now  at  University  of  California,  Irvine 


Larry  L.  Oeaven,  Acting  Director 
505/667-3912,  Fax:  -2891 
/dea  ven  @  telomere.  I  an  I. go  v 

Lynn  Clark,  Technical  Coordinator 
505/667-9376,  Fax:  -2891 
clark@telomere.lanl.gov 

http-J/wvfw-ls.  lanl.go  v/masterhgp.  html 


Biological  research  was  initiated  at  Los  Alamos  National 
Laboratory  (LANL)  in  the  1940s,  when  the  laboratory  be- 
gan to  investigate  the  physiological  and  genetic  conse- 
quences of  radiation  exposure.  Eventual  establishment  of 
the  national  genetic  sequence  databank  called  GenBank, 
the  National  Flow  Cytometry  Resource,  numerous  related 
individual  research  projects,  and  fulfillment  of  a  key  role 
in  the  National  Laboratory  Gene  Library  Project  all  con- 
tributed to  LANL's  selection  as  the  site  for  the  Center  for 
Human  Genome  Studies  in  1988. 


Center  Organization  and  Activities 


Construction  of  a  low-resolution  STS  map  of  human 
chromosome  5  consisting  of  517  STS  markers  region- 
ally assigned  by  somatic-cell  hybrid  approaches. 
Around  95%  mega-YAC-STS  coverage  (50  million 
bases)  of  5p  has  been  achieved.  Additionally,  about 
40  million  bases  of  5q  mega-YAC-STS  coverage  have 
been  obtained  collaboratively. 

Refmement  of  BAC  cloning  procedures  for  future 
production  of  chromosome-specific  libraries.  Success- 
ful partial  digestion  and  cloning  of  microgram  quanti- 
ties of  chromosomal  DNA  embedded  in  agarose  plugs. 
Efforts  continue  to  increase  the  average  insert  size  to 
about  100,000  bases. 


The  LANL  genome  center  is  organized  into  four  broad  ar- 
eas of  research  and  support:  Physical  Mapping,  DNA  Se-      DNA  Sequencing 
quencing.  Technology  Development,  and  Biological  Inter- 
faces. Each  area  consists  of  a  variety  of  projects,  and  work 
is  distributed  among  five  LANL  Divisions  (Life  Sciences; 
Theoretical;  Computing,  Information,  and  Communica- 
tions; Chemical  Science  and  Technology;  and  Engineering 
Sciences  and  Applications).  Extensive  interdisciplinary 
interactions  are  encouraged. 


Physical  Mapping 

The  construction  of  chromosome-  and  region-specific 
cosmid,  bacterial  artificial  chromosome  (BAC),  and  yeast 
artificial  chromosome  (YAC)  recombinant  DNA  libraries 
is  a  primary  focus  of  physical  mapping  activities  at  LANL. 
Specific  work  includes  the  construction  of  high-resolution 
maps  of  human  chromosomes  5  and  16  and  associated 
informatics  and  gene  discovery  tasks. 

Accomplishments 

•      Completion  of  an  integrated  physical  map  of  human 
chromosome  16  consisting  of  both  a  low-resolution 
YAC  contig  map  and  a  high-resolution  cosmid  contig 
map.  With  sequence  tagged  site  (STS)  markers  pro- 
vided on  average  every  125.000  bases,  the  YAC-STS 
map  provides  almost-complete  coverage  of  the 
chromosome's  euchromatic  arms.  All  available  loci 
continue  to  be  incorporated  into  the  map. 


DNA  sequencing  at  the  LANL  center  focuses  on  low -pass 
sample  sequencing  (SASE)  of  large  genomic  regions. 
SASE  data  is  deposited  in  publicly  available  databases  to 
allow  for  wide  distribution.  Finished  sequencing  is  priori- 
tized from  initial  SASE  analysis  and  pursued  by  parallel 
primer  walking.  Informatics  development  includes  data 
tracking,  gene-discovery  integration  with  the  Sequence 
Comparison  ANalysis  (SCAN)  program,  and  functional 
genomics  interaction. 

Accomplishments 

■       SASE  sequencing  of  I.S  million  bases  bora  the  pl3 
region  of  human  chromosome  16. 

•  Discovery  of  more  than  100  genes  in  SASE  se- 
quences. 

•  Generation  of  finished  sequence  for  a  240,000-base 
telomeric  region  of  human  chromosome  7q.  From  ini- 
tial sequences  generated  by  SASE,  oligonucleotides 
were  synthesized  and  used  for  primer  walking  directly 
from  cosmids  comprising  the  contig  map.  Complete 
sequencing  was  performed  to  determine  what  genes,  if 
any,  are  near  the  7q  terminus.  This  intriguing  region 
lacks  significant  blocks  of  subtelomeric  repeat  DNA 
typically  present  near  eukaryotic  telomeres. 
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•  Complete  single-pass  sequencing  of2018exonclones 
generated  from  LANL's  flow-sorted  human  chromo- 
some 16  cosmid  library.  About  950  discrete  sequences 
were  identified  by  sequence  analysis.  Nearly  800  ap- 
pear to  represent  expressed  sequences  from  chromo- 
some 16. 

•  Development  of  Sequence  Viewer  to  display  ABI  se- 
quences with  trace  data  on  any  computer  having  an 
Internet  connection  and  a  Netscape  World  Wide  Web 
browser. 

•  Sequencing  and  analysis  of  a  novel  pericentromeric 
duplication  of  a  gene-rich  cluster  between  16pll.l  and 
Xq28  (in  collaboration  with  Baylor  College  of  Medi- 
cine). 

Technology  Development 

Technology  development  encompasses  a  variety  of  activi- 
ties, both  short  and  long  term,  including  novel  vectors  for 
library  construction  and  physical  mapping;  automation  and 
robotics  tools  for  physical  mapping  and  sequencing;  novel 
approaches  to  DNA  sequencing  involving  single-molecule 
detection;  and  novel  approaches  to  informatics  tools  for 
gene  identification. 

Accomplishments 

•  Development  of  SCAN  program  for  large-scale  se- 
quence analysis  and  annotation,  including  a  translator 
converting  SCAN  data  to  GIO  format  for  submission  to 
Genome  Sequence  DataBase. 

•  Application  of  flow-cytometric  approach  to  DNA  siz- 
ing of  PI  artificial  chromosome  (PAC)  clones.  Less 
than  one  picogram  of  linear  or  supercoiled  DNA  is  ana- 
lyzed in  under  3  minutes.  Sizing  range  has  been  ex- 
tended down  to  287  base  pairs.  Efforts  continue  to  ex- 
tend the  upper  limit  beyond  167,000  bases. 

•  Characterization  of  the  detection  of  single,  fluores- 
cently  tagged  nucleotides  cleaved  firom  multiple  DNA 
fragments  suspended  in  the  flow  stream  of  a  flow  cy- 
tometer.  The  cleavage  rate  for  Exo  III  at  37°C  was 
measured  to  be  about  5  base  pairs  per  second  per  Ml 3 
DNA  fragment.  To  achieve  a  single-color  sequencing 
demonstration,  either  the  background  burst  rate  (cur- 
rently about  5  bursts  per  second)  must  be  reduced  or 
the  exonuclease  cleavage  rate  must  be  increased  sig- 
nificantly. Techniques  to  achieve  both  are  being  ex- 
plored. 

•  Construction  of  a  simple  and  compact  apparatus,  based 
on  a  diode-pumped  Nd:YAG  laser,  for  routine  DNA 
fragment  sizing. 

•  Development  of  a  new  approach  to  detect  coding  se- 
quences in  DNA.  This  complete  spectral  analysis  of 


coding  and  noncoding  sequences  is  as  sensitive  in  its 
first  implementations  as  the  best  existing  techniques. 

•  Use  of  phylogenetic  relationships  to  generate  new 
profiles  of  amino  acid  usage  in  conserved  domains. 
The  profiles  are  particularly  useful  for  classification 
of  distantly  related  sequences. 

Biological  Interfaces 

The  Biological  Interfaces  effort  targets  genes  and  chromo- 
some regions  associated  with  DNA  damage  and  repair, 
mitotic  stability,  and  chromosome  structure  and  function 
as  primary  subjects  for  physical  mapping  and  sequencing. 
Specific  disease-associated  genes  on  human  chromo- 
some 5  (.e.g.,  Cri-du-Chat  syndrome)  and  on  16  (e.g.. 
Batten's  disease  and  Fanconi  anemia)  are  the  subjects  of 
collaborative  biological  projects. 

Accomplishments 

•  Identification  of  two  human  7q  exons  having  99%  ho- 
mology to  the  cDNA  of  a  known  human  gene,  vasoac- 
tive intestinal  peptide  receptor  2A.  Preliminary  data 
suggests  that  the  VIPR2K  gene  is  expressed. 

•  Identification  of  numerous  expressed  sequence  tags 
(ESTs)  localized  to  the  7q  region.  Since  three  of  the 
ESTs  contain  at  least  two  regions  with  high  confi- 
dence of  homology  (-90%),  genes  in  addition  to 
VIPR2A.  may  exist  in  the  terminal  region  of  7q. 

•  Generation  of  high-resolution  cosmid  coverage  on 
human  chromosome  5p  for  the  larynx  and  critical  re- 
gions identified  with  Cri-du-Chat  syndrome,  the  most 
common  human  terminal-deletion  syndrome  (in  col- 
laboration with  Thomas  Jefferson  University). 

•  Refinement  of  the  Wolf-Hirschhom  syndrome  (WHS) 
critical  region  on  human  chromosome  4p.  Using  the 
SCAN  program  to  identify  genes  likely  to  contribute 
to  WHS,  the  project  serves  as  a  model  for  defining  the 
interaction  between  genomic  sequencing  and  clinical 
research. 

•  Collaborative  construction  of  contigs  for  human  chro- 
mosome 16.  including  1.05  million  bases  in  cosmids 
through  the  familial  Mediterranean  fever  (FMF)  gene 
region  (with  members  of  the  FMF  Consortium)  and 
700.000  bases  in  PI  clones  encompassing  the  poly- 
cystic kidney  disease  gene  (with  Integrated  Genetics, 
Inc.). 

•  Collaborative  identification  and  determination  of  the 
complete  genomic  structure  of  the  Batten's  disease 
gene  (with  members  of  the  BDG  Consortium),  the 
gamma  subunit  of  the  human  amiloride-sensitive  epi- 
thelial channel  (Liddle's  syndrome,  with  University  of 
Iowa),  and  the  polycystic  kidney  disease  gene  (with 
Integrated  Genetics). 
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Participation  in  an  international  collaborative  research 
consortium  that  successfully  identified  the  gene  re- 
sponsible for  Fanconi  anemia  type  A. 


Development  license  and  exclusive  license  to  LANL's 
DNA  sizing  patent  obtained  by  Molecular  Technology, 
Inc.,  for  commercialization  of  single-molecule  detec- 
tion capability  to  DNA  sizing. 


Patents,  Licenses,  and  CRADAs 

Rhett  L.  Affleck,  James  N.  Demas.  Peter  M.  Goodwin, 
Jay  A.  Schecker,  Ming  Wu,  and  Richard  A.  Keller, 
"Reduction  of  Diffusional  Defocusing  in  Hydrody- 
namically  Focused  Flows  by  Complexing  with  a  High 
Molecular  Weight  Adduct,"  United  States  Patent,  filed 
December  1996. 

R.L.  Affleck,  W.P.  Ambrose,  J.D.  Demas,  P.M. 
Goodwin,  M.E.  Johnson,  R.A.  Keller,  J.T.  Petty,  J.A. 
Schecker,  and  M.  Wu,  "Photobleaching  to  Reduce  or 
Eliminate  Luminescent  Impurities  for  Ultrasensitive 
Luminescence  Analysis,"  United  States  Patent,  S-87, 
208,  accepted  September  1997. 

•  J.H.  Jett,  ML.  Hammond,  R.A.  Keller,  B.L.  Marrone, 
and  J.C.  Martin,  "DNA  Fragment  Sizing  and  Sorting 
by  Laser-Induced  Fluorescence,"  United  States  Patent, 
S.N.  75,001,  allowed  May  1996. 

•  James  H.  Jett,  "Method  for  Rapid  Base  Sequencing  in 
DNA  and  RNA  with  Three  Base  Labeling,"  in  prepa- 
ration. 


Future  Plans 

LANL  has  joined  a  collaboration  with  California  Institute 
of  Technology  and  The  Institute  for  Genomic  Research  to 
construct  a  BAG  map  of  the  p  arm  of  human  chromo- 
some 16  and  to  complete  the  sequence  of  a  20-million- 
base  region  of  this  map. 

In  its  evolving  role  as  part  of  the  new  DOE  Joint  Genome 
Institute,  LANL  will  continue  scaleup  activities  focused  on 
high-throughput  DNA  sequencing.  Initial  targets  include 
genes  and  DNA  regions  associated  with  chromosome 
structure  and  function,  syntenic  break-points,  and  relevant 
disease-gene  loci. 

A  joint  DNA  sequencing  center  was  established  recently 
by  LANL  at  the  University  of  New  Mexico.  This  facility  is 
responsible  for  determining  the  DNA  sequence  of  clones 
constructed  at  LANL,  then  returning  the  data  to  LANL  for 
analysis  and  archiving. 
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Lawrence  Berkeley  National  Laboratory  Human  Genome  Center 


Human  Genome  Center 

Lawrence  Berkeley  National  Laboratory 

1  Cyclotron  Road 

Berkeley,  CA  94720 

Michael  Palazzolo,*  Director,  1996-97 


Contact:  Mohandas  Naria 
510/486-7029,  Fax:  -6746 
mohandas_  narla  @  macmall.  lb  I.  go  v 

Joyce  Pfelffer,  Administrative  Assistant 

httpJ/www-hgc.lbl.gov/GenomeHome.html 


'Now  at  Amgen,  Inc. 


Since  1937  the  Ernest  Orlando  Lawrence  Berkeley  Na- 
tional Laboratory  (LBNL)  has  been  a  major  contributor  to 
knowledge  about  human  health  effects  resulting  from  en- 
ergy production  and  use.  That  was  the  year  John  Lawrence 
went  to  Berkeley  to  use  his  brother  Ernest's  cyclotrons  to 
launch  the  application  of  radioactive  isotopes  in  biological 
and  medical  research.  Fifty  years  later,  Berkeley  Lab's  Hu- 
man Genome  Center  was  established. 

Now,  after  another  decade,  an  expansion  of  biological  re- 
search relevant  to  Human  Genome  Project  goals  is  being 
carried  out  within  the  Life  Sciences  Division,  with  support 
from  the  Information  and  Computing  Sciences  and  Engi- 
neering divisions.  Individuals  in  these  research  projects  are 
making  important  new  contributions  to  the  key  fields  of 
molecular,  cellular,  and  structural  biology;  physical  chem- 
istry; data  management;  and  scientific  instrumentation. 
Additionally,  industry  involvement  in  this  growing  venture 
is  stimulated  by  Berkeley  Lab's  location  in  the  San  Fran- 
cisco Bay  area,  home  to  the  largest  congregation  of  bio- 
technology research  facilities  in  the  world. 

In  July  1997  the  Berkeley  genome  center  became  part  of 
the  Joint  Genome  Institute. 


Sequencing 

Large-scale  genomic  sequencing  has  been  a  central,  ongo- 
ing activity  at  Berkeley  Lab  since  1991.  It  has  been  funded 
jointly  by  DOE  (for  human  genome  production  sequencing 
and  technology  development)  and  the  NIH  National  Hu- 
man Genome  Research  Institute  [for  sequencing  the 
Drosophila  melanogaster  model  system,  which  is  carried 
out  in  partnership  with  the  University  of  California,  Berke- 
ley (UCB)].  The  human  genome  sequencing  area  at  Berke- 
ley Lab  consists  of  five  groups;  Bioinstrumentation,  Auto- 
mation, Informatics,  Biology,  and  Development.  Comple- 
menting these  activities  is  a  group  in  Life  Sciences  Divi- 
sion devoted  to  fimctional  genomics,  including  the 
transgenics  program. 

The  directed  DNA  sequencing  strategy  at  Berkeley  Lab 
was  designed  and  implemented  to  increase  the  efficiency 


of  genomic  sequencing.  A  key  element  of  the  directed  ap- 
proach is  maintaining  information  about  the  relative  posi- 
tions of  potential  sequencing  templates  throughout  the  en- 
tire sequencing  process.  Thus,  intelligent  choices  can  be 
made  about  which  templates  to  sequence,  and  the  number 
of  selected  templates  can  be  kept  to  a  minimum.  More  im- 
portant, knowledge  of  the  interrelationship  of  sequencing 
runs  guides  the  assembly  process,  making  it  more  resistant 
to  difficulties  imposed  by  repeated  sequences.  As  of  July  3, 
1997,  Berkeley  Lab  had  generated  4.4  megabases  of  hu- 
man sequence  and,  in  collaboration  with  UCB,  had  tallied 
7.6  megabases  of  Drosophila  sequence. 

Instrumentation  and  Automation 

The  instrumentation  and  automation  program  encompasses 
the  design  and  fabrication  of  custom  apparatus  to  facilitate 
experiments,  the  programming  of  laboratory  robots  to  auto- 
mate repetitive  procedures,  and  the  development  of  (1)  im- 
proved hardware  to  extend  the  applicability  range  of  exist- 
ing commercial  robots  and  (2)  an  integrated  operating  sys- 
tem to  control  and  monitor  experiments.  Although  some 
discrete  instrumentation  modules  used  in  the  integrated  pro- 
tocols are  obtained  commercially,  LBNL  designs  its  own 
custom  instruments  when  existing  capabihties  are  inadequate. 
The  instrumentation  modules  are  then  integrated  into  a 
large  system  to  facilitate  large-scale  production  sequencing. 
In  addition,  a  significant  effort  is  devoted  to  improving 
fluorescence-assay  methods,  including  DNA  sequence 
analysis  and  mass  spectrometry  for  molecular  sizing. 

Recent  advances  in  the  instrumentation  group  include  DNA 
Prep  machine  and  Prep  Track.  These  instruments  are  de- 
signed to  automate  completely  the  highly  repetitive  and  la- 
bor-intensive DNA-preparation  procedure  to  provide  higher 
daily  throughput  and  DNA  of  consistent  quality  for  se- 
quencing (see  Web  pages:  hrtp://hgighub.lbl.gov/e.id/ 
DNAPrep/TitlePage.html  and  hnp://hgighub,lbt.gov/esd/ 
repTrackWebpage/preptraclchtm). 

Berkeley  Lab's  near-term  needs  are  for  960  samples  per  day 
of  DNA  extracted  from  overnight  bacteria  growths.  The 
DNA  protocol  is  a  modified  boil  prep  prepared  in  a  96-well 
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format.  Overnight  bacteria  growths  are  lysed,  and  samples 
are  separated  from  cell  debris  by  centrifiigation.  The  DNA  is 
recovered  by  ethanol  precipitation. 

Informatics 

The  informatics  group  is  focused  on  hardware  and  software 
support  and  system  administration,  software  development 
for  end  sequencing,  transposon  mapping  and  sequence  tem- 
plate selection,  data-flow  automation,  gene  finding,  and  se- 
quence analysis.  Data-flow  automation  is  the  main  empha- 
sis. Six  key  steps  have  been  identified  in  this  process,  and 
software  is  being  written  and  tested  to  automate  all  six.  The 
first  step  involves  controlling  gel  quality,  trimming  vector 
sequence,  and  storing  the  sequences  in  a  database.  A  pro- 
gram module  called  Move-Track-Trim,  which  is  now  used 
in  production,  was  written  to  handle  these  steps.  The  second 
through  fourth  steps  in  this  process  involve  assembling,  ed- 
iting, and  reconstructing  PI  clones  of  80,000  base  pairs 
from  400-base  traces.  The  fifth  step  is  sequence  annotation, 
and  the  sixth  is  data  submission. 

Annotation  can  greatly  enhance  the  biological  value  of  these 
sequences.  Useful  annotations  include  homologies  to  known 
genes,  possible  gene  locations,  and  gene  signals  such  as  pro- 
moters. LBhfL  is  developing  a  workbench  for  automatic  se- 
quence annotation  and  annotation  viewing  and  editing.  The 
goal  is  to  run  a  series  of  sequence-analysis  tools  and  display 
the  results  to  compare  the  various  predictions.  Researchers 
then  will  be  able  to  examine  all  the  annotations  (for  ex- 
ample, genes  predicted  by  various  gene-finding  methods) 
and  select  the  ones  that  look  best. 

Nomi  Harris  developed  Genotator,  an  annotation  workbench 
consisting  of  a  stand-alone  annotation  browser  and  several 
sequence-analysis  functions.  The  back  end  runs  several  gene 
finders,  homology  searches  (using  BLAST),  and  signal 
searches  and  saves  the  results  in  ".ace"  format.  Genotator 
thus  automates  the  tedious  process  of  operating  a  dozen  dif- 
ferent sequence-analysis  programs  with  many  different  in- 
put and  output  formats.  Genotator  can  function  via  com- 
mand-line arguments  or  with  the  graphical  user  interface 
(http://www-hgc.lbl.gov/inf/annotation.html). 


Progress  to  Date 

Chromosome  5 

Over  the  last  year,  the  center  has  focused  its  production  ge- 
nomic sequencing  on  the  distal  40  megabases  of  the  human 
chromosome  5  long  arm.  This  region  was  chosen  because  it 
contains  a  cluster  of  growth  factor  and  receptor  genes  and  is 
likely  to  yield  new  and  functionally  related  genes  through 
long-range  sequence  analysis.  Results  to  date  include: 


•  40-megabase  nonchimeric  map  containing  82  yeast 
artificial  chromosomes  (YACs)  in  the  chromosome  S 
distal  long  arm. 

•  20-megabase  contig  map  in  the  region  of  5q23-q33 
that  contains  198  Pis,  60  PI  artificial  chromosomes, 
and  495  bacterial  artificial  chromosomes  (BACs) 
linked  by  563  sequenced  tagged  sites  (STSs)  to  form 
contigs. 

•  20-megabase  bins  containing  370  BACs  in  74  bins  in 
the  region  of  5q33-q35. 

Chromosome  21 

An  early  project  in  the  study  of  Down  syndrome  (DS), 
which  is  characterized  by  chromosome  2 1  trisomy,  con- 
structed a  high-resolution  clone  map  in  the  chromosome  21 
DS  region  to  be  used  as  a  pilot  study  in  generating  a  con- 
tiguous gene  map  for  all  of  chromosome  21 .  This  project 
has  integrated  P I  mapping  efforts  with  transgenic  studies 
in  the  Life  Sciences  Division.  PI  maps  provide  a  suitable 
form  of  genomic  DNA  for  isolating  and  mapping  cDNA. 

•  186  clones  isolated  in  the  major  DS  region  of  chromo- 
some 21  comprising  about  3  megabases  of  genomic 
DNAextending  from  D2IS17  to  ETS2.  Through 
cross-hybridization,  overlapping  Pis  were  identified, 
as  well  as  gaps  between  two  PI  contigs,  and 
transgenic  mice  were  created  from  PI  clones  in  the 
DS  region  for  use  in  phenotypic  studies. 


IVansgenic  Mice 

One  of  the  approaches  for  determining  the  biological  func- 
tion of  newly  identified  genes  uses  YAC  transgenic  mice. 
Human  sequence  harbored  by  YACs  in  transgenic  mice  has 
been  shown  to  be  correctly  regulated  both  temporally  and 
spatially.  A  set  of  nonchimeric  overlapping  YACs  identified 
from  the  5q3 1  region  has  been  used  to  create  transgenic 
mice.  This  set  of  transgenic  mice,  which  together  harbor 
1.5  megabases  of  human  sequence,  will  be  used  to  assess 
the  expression  pattern  and  potential  function  of  putative 
genes  discovered  in  the  5q3l  region.  Additional  mapping 
and  sequencing  are  under  way  in  a  region  of  human  chro- 
mosome 20  amplified  in  certain  breast  tumor  cell  lines. 

Resource  for  Molecular  Cytogenetics 

Divining  landmarks  for  human  disease  amid  the  enormous 
plain  of  the  human  genetic  map  is  the  mission  of  an  ambi- 
tious partnership  among  the  Berkeley  Lab;  University  of 
California,  San  Francisco;  and  a  diagnostics  company.  The 
collaborative  Resource  for  Molecular  Cytogenetics  is 
charting  a  course  toward  important  sites  of  biological 
interest  on  the  23  pairs  of  human  chromosomes  (http:// 
rmc-www.lbl.gov). 
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The  Resource  employs  the  many  tools  of  molecular  cyto- 
genetics. The  most  basic  of  these  tools,  and  the  corner- 
stone of  the  Resource's  portfolio  of  proprietary  technol- 
ogy, is  a  method  generally  known  as  "chromosome  paint- 
ing," which  uses  a  technique  referred  to  as  fluorescence  in 
situ  hybridization  or  FISH,  This  technology  was  invented 
by  LBNL  Resource  leaders  Joe  Gray  and  Dan  Pinkel. 

A  technology  to  emerge  recently  from  the  Resource  is 
known  as  "Quantitative  DNA  Fiber  Mapping  (QDFM)." 
High-resolution  human  genome  maps  in  a  form  suitable 
for  DNA  sequencing  traditionally  have  been  constructed 
by  various  methods  of  fingerprinting,  hybridization,  and 


identification  of  overlapping  STSs.  However,  these  tech- 
niques do  not  readily  yield  information  about  sequence 
orientation,  the  extent  of  overlap  of  these  elements,  or  the 
size  of  gaps  in  the  map.  Ulli  Weier  of  the  Resource  devel- 
oped the  QDFM  method  of  physical  map  assembly  that 
enables  the  mapping  of  cloned  DNA  directly  onto  linear, 
fully  extended  DNA  molecules.  QDFM  allows  unambigu- 
ous assembly  of  critical  elements  leading  to  high-resolution 
physical  maps.  This  task  now  can  be  accomplished  in  less 
than  2  days,  as  compared  with  weeks  by  conventional 
methods.  QDFM  also  enables  detection  and  characteriza- 
tion of  gaps  in  existing  physical  maps — a  crucial  step  toward 
completing  a  definitive  human  genome  map. 
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The  Human  Genome  Project  soon  will  need  to  increase  rap- 
idly the  scale  at  which  human  DNA  is  analyzed.  The  ulti- 
mate goal  is  to  determine  the  order  of  the  3  billion  bases 
that  encode  all  heritable  information.  During  the  20  years 
since  effective  methods  were  introduced  to  carry  out  DNA 
sequencing  by  biochemical  analysis  of  recombinant-DNA 
molecules,  these  techniques  have  improved  dramatically.  In 
the  late  1970s,  segments  of  DNA  spanning  a  few  thousand 
bases  challenged  the  capacity  of  world-class  sequencing 
laboratories.  Now,  a  few  million  base  pairs  per  year  repre- 
sent state-of-the-art  output  for  a  single  sequencing  center. 

However,  the  Human  Genome  Project  is  directed  toward 
completing  the  human  sequence  in  5  to  10  years,  so  the  data 
must  be  acquired  with  technology  available  now.  This  goal, 
while  clearly  feasible,  poses  substantial  organizational  and 
technical  challenges.  Organizationally,  genome  centers 
must  begin  building  data-production  units  capable  of  sus- 
tained, cost-effective  operation.  Technically,  many  incre- 
mental refinements  of  current  technology  must  be  intro- 
duced, particularly  those  that  remove  impediments  to  in- 
creasing the  scale  of  DNA  sequencing.  The  University  of 
Washington  (UW)  Genome  Center  is  active  in  both  areas. 

Production  Sequencing 

Both  to  gain  experience  in  the  production  of  high-quality, 
low-cost  DNA  sequence  and  to  generate  data  of  immediate 
biological  interest,  the  center  is  sequencing  several  regions 
of  human  and  mouse  DNA  at  a  current  throughput  of  2  mil- 
lion bases  per  year.  This  "production  sequencing"  has  three 
major  targets:  the  human  leukocyte  antigen  (HLA)  locus  on 
human  chromosome  6,  the  mouse  locus  encoding  the  alpha 
subunit  of  T-cell  receptors,  and  an  "anonymous"  region  of 
human  chromosome  7. 

The  HLA  locus  encodes  genes  that  must  be  closely  matched 
between  organ  donors  and  organ  recipients.  This  sequence 
data  is  expected  to  lead  to  long-term  improvements  in  the 
ability  to  achieve  good  matches  between  unrelated  organ 
donors  and  recipients. 

The  mouse  locus  that  encodes  components  of  the  T-cell- 
receptor  family  is  of  interest  for  several  reasons.  The  locus 
specifies  a  set  of  proteins  that  play  a  critical  role  in 
cell-mediated  immune  responses.  It  provides  sequence  data 
that  will  help  in  the  design  of  new  experimental  approaches 
to  the  study  of  immunity  in  mice — one  of  the  most  impor- 
tant experimental  animals  for  immunological  research.  In 


addition,  the  locus  will  provide  one  of  the  first  lai^ge  blocks 
of  DNA  sequence  for  which  both  human  and  mouse  ver- 
sions are  known. 

Human-mouse  sequence  comparisons  provide  a  powerful 
means  of  identifying  the  most  important  biological  features 
of  DNA  sequence  because  these  features  are  often  highly 
conserved,  even  between  such  biologically  different  organ- 
isms as  human  and  mouse.  Finally,  sequencing  an  "anony- 
mous" region  of  human  chromosome  7,  a  region  about 
which  little  was  known  previously,  provides  experience  in 
carrying  out  large-scale  sequencing  under  the  conditions 
that  will  prevail  throughout  most  of  the  Human  Genome 
Project. 

Technology  for  Large-Scale  Sequencing 

In  addition  to  these  pilot  projects,  the  UW  Genome  Center 
is  developing  incremental  improvements  in  current  se- 
quencing technology.  A  particular  focus  is  on  enhanced 
computer  software  to  process  raw  data  acquired  with  auto- 
mated laboratory  instruments  that  are  used  in  DNA  map- 
ping and  sequencing.  Advanced  instrumentation  is  commer- 
cially available  for  determining  DNA  sequence  via  the 
"four-color-fluorescence  method,"  and  this  instrumentation 
is  expected  to  carry  the  main  experimental  load  of  the  Hu- 
man Genome  Project.  Raw  data  produced  by  these  instru- 
ments, however,  require  extensive  processing  before  they 
are  ready  for  biological  analysis. 

Large-scale  sequencing  involves  a  "divide-and-conquer" 
strategy  in  which  the  huge  DNA  molecules  present  in  hu- 
man cells  are  broken  into  smaller  pieces  that  can  be  propa- 
gated by  recombinant-DNA  methods.  Individual  analyses 
ultimately  are  carried  out  on  segments  of  less  than  1000 
bases.  Many  such  analyses,  each  of  which  still  contains  nu- 
merous errors,  must  be  melded  together  to  obtain  finished 
sequence.  During  the  melding,  errors  in  individual  analyses 
must  be  recognized  and  corrected.  In  typical  large-scale  se- 
quencing projects,  the  results  of  thousands  of  analyses  are 
melded  to  produce  highly  acciu^te  sequence  (less  than  one 
error  in  1 0,000  bases)  that  is  continuous  in  blocks  of 
100,000  or  more  bases.  The  UW  Genome  Center  is  playing 
a  major  role  in  developing  software  that  allows  this  process 
to  be  carried  out  automatically  with  little  need  for  expert 
intervention.  Software  developed  in  the  UW  center  is  used 
in  more  than  50  sequencing  laboratories  around  the  world, 
including  most  of  the  large-scale  sequencing  centers  pro- 
ducing data  for  the  Human  Genome  Project. 
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High-Resolution  Physical  Mapping 

The  UW  Genome  Center  also  is  developing  improved  soft- 
ware that  addresses  a  higher-level  problem  in  large-scale 
sequencing.  The  starting  point  for  large-scale  sequencing 
typically  is  a  recombinant-DNA  molecule  that  allows 
propagation  of  a  particular  human  genomic  segment  span- 
ning 50,000  to  200,000  bases.  Much  effort  during  the  last 
decade  has  gone  into  the  physical  mapping  of  such  mol- 
ecules, a  process  that  allows  huge  regions  of  chromosomes 
to  be  defmed  in  terms  of  sets  of  overlapping  recombinant- 
DNA  molecules  whose  precise  positions  along  the  chro- 
mosome are  known.  However,  the  precision  required  for 
knowing  relationships  of  recombinant-DNA  molecules 
derived  from  neighboring  chromosomal  portions  increases 
as  the  Human  Genome  Project  shifts  its  emphasis  from 
mapping  to  sequencing. 

High-resolution  maps  both  guide  the  orderly  sequencing  of 
chromosomes  and  play  a  critical  role  in  quality  control. 
Only  by  mapping  recombinant-DNA  molecules  at  high 
resolution  can  subtle  defects  in  particular  molecules  be 
recognized.  Such  defective  human  DNA  sources,  which 


are  not  faithful  replicas  of  the  human  genome,  must  be 
weeded  out  before  sequencing  can  begin.  The  UW  Genome 
Center  has  a  major  program  in  high-resolution  physical 
mapping  which,  like  the  work  on  sequencing  itself,  uses 
advanced  computing  tools.  The  center  is  producing  maps 
of  regions  targeted  for  sequencing  on  a  just-in-time  basis. 
These  highly  detailed  maps  are  proving  extremely  valuable 
in  facilitating  the  production  of  high-quality  sequence. 

Ultimate  Goal 

Although  many  challenges  currently  posed  by  the  Human 
Genome  Project  are  highly  technical,  the  ultimate  goal  is 
biological.  The  project  will  deliver  immense  amounts  of 
high-quality,  continuous  DNA  sequence  into  publicly  ac- 
cessible databases.  These  data  will  be  annotated  so  that 
biologists  who  use  them  will  know  the  most  likely  posi- 
tions of  genes  and  have  convenient  access  to  the  best 
available  clues  about  the  probable  function  of  these  genes. 
The  better  the  technical  solutions  to  current  challenges,  the 
better  the  center  will  be  able  to  serve  future  users  of  the 
human  genome  sequence. 
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The  release  of  Version  6  of  the  Genome  Database  (GDB) 
in  January  1996  signaled  a  major  change  for  both  the  sci- 
entific community  and  GDB  staff.  GDB  6.0  introduced  a 
number  of  significant  improvements  over  previous  ver- 
sions of  GDB.  most  notably  a  revised  data  representation 
for  genes  and  genomic  maps  and  a  new  curatorial  model 
for  the  database.  These  new  features,  along  with  a  remod- 
eled database  structure  and  new  schema  and  user  inter- 
face, provide  a  resource  with  the  potential  to  integrate  all 
scientific  information  currently  available  on  human 
genomics.  GDB  rapidly  is  becoming  the  international 
biomedical  research  community's  central  source  for  in- 
formation about  genomic  structure,  content,  diversity, 
and  evolution. 

A  New  Data  Model 

Inherent  in  the  underlying  organization  of  information  in 
GDB  is  an  improved  model  for  genes,  maps,  and  other 
classes  of  data.  In  particular,  genomic  segments  (any 
named  region  of  the  genome)  and  maps  are  being  ex- 
panded regularly.  New  segment  types  have  been  added  to 
support  the  integration  of  mapping  and  sequencing  data 
(for  example,  gene  elements  and  repeats)  and  the  con- 
struction of  comparative  maps  (syntenic  regions).  New 
map  types  include  comparative  maps  for  representing 
conserved  syntenies  between  species  and  comprehensive 
maps  that  combine  data  from  all  the  various  submitted 
maps  within  GDB  to  provide  a  single  integrated  view  of 
the  genome.  Experimental  observations  such  as  order, 
size,  distance,  and  chimerism  are  also  available. 

Through  the  World  Wide  Web.  GDB  links  its  stored  data 
with  many  other  biological  resources  on  the  Internet. 
GDB's  External  Link  category  is  a  growing  collection  of 
cross-references  established  between  GDB  entities  and 
related  information  in  other  databases.  By  providing  a 
place  for  these  cross-references.  GDB  can  serve  as  a  cen- 
tral point  of  inquiry  into  technical  data  regarding  human 
genomics. 


Direct  Community  Data  Submission 
and  Curation 

Two  methods  for  data  submission  are  in  use.  For  individu- 
als submitting  small  amounts  of  data,  interactive  editing 
of  the  database  through  the  Web  became  available  in 
April  1996,  and  the  process  has  undergone  several  simpli- 
fications since  that  time.  This  continues  to  be  an  area  of 
development  for  GDB  because  all  editing  must  take  place 
at  the  Baltimore  site,  and  Internet  connections  from  out- 
side North  America  may  be  too  slow  for  interactive  edit- 
ing to  be  practical.  Until  these  difficulties  are  resolved, 
GDB  encourages  scientists  with  limited  connectivity  to 
Baltimore  to  submit  their  data  via  more  traditional  means 
(e-mail,  fax,  mail,  phone)  or  to  prepare  electronic  submis- 
sions for  entry  by  the  data  group  on  site. 

For  centers  submitting  large  quantities  of  data,  GDB  de- 
veloped an  electronic  data  submission  (EDS)  tool,  which 
provides  the  means  to  specify  login  password  validation 
and  commands  for  inserting  and  updating  data  in  GDB. 
The  EDS  syntax  includes  a  mechanism  for  relating  a 
center's  local  naming  conventions  to  GDB  objects.  Data 
submitted  to  GDB  may  be  stored  privately  for  up  to 
6  months  before  it  automatically  becomes  public.  The 
database  is  programmed  to  enforce  this  Human  Genome 
Project  policy.  Detailed  specifications  of  GDB's  EDS  syn- 
tax and  other  submission  instructions  are  available  (EDS 
prototype,  http://www.gdb.org/eds). 

Since  the  EDS  system  was  implemented,  GDB  has  put 
forth  an  aggressive  effort  to  increase  the  amount  of  data 
stored  in  the  database.  Consequently,  the  database  has 
grown  tremendously.  During  1996  it  grew  from  1.8  to 
6.7  gigabytes. 

To  provide  accountability  regarding  data  quality,  the  shift 
to  community  curation  introduced  the  idea  that  individu- 
als and  laboratories  own  the  data  they  submit  to  GDB  and 
that  other  researchers  cannot  modify  it  However,  others 
should  be  able  to  add  information  and  comments,  so  an 
additional  feature  is  the  commu-nity's  ability  to  conduct 
electronic  online  public  discussions  by  annotating  the 
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database  submissions  of  fellow  researchers.  GDB  is  the 
first  database  of  its  kind  to  offer  this  feature,  and  the 
number  of  third-party  annotations  is  increasing  in  the 
form  of  editorial  commentary,  links  to  literature  citations, 
and  links  to  other  databases  external  to  GDB.  These  links 
are  an  important  part  of  the  curatorial  process  because 
they  make  other  data  collections  available  to  GDB  users 
in  an  appropriate  context. 

Improved  Map  Representation 
and  Querying 

Accompanying  the  release  of  GDB  6.0,  the  program 
Mapview  creates  graphical  displays  of  maps.  Mapview 
was  developed  at  GDB  to  display  a  number  of  map  types 
(cytogenetic,  radiation  hybrid,  contig,  and  linkage)  using 
common  graphical  conventions  found  in  the  literature. 
Mapview  is  designed  to  stand  alone  or  to  be  used  in  con- 
junction with  a  Web  browser  such  as  Netscape,  thereby 
creating  an  interactive  graphical  display  system.  When 
used  with  Netscape,  Mapview  allows  the  user  to  retrieve 
details  about  any  displayed  map  object. 

Maps  are  accessed  through  the  query  form  for  genomic 
segment  and  its  subclasses  via  a  special  program  that  al- 
lows the  user  to  select  whole  maps  or  slices  of  maps  from 
specific  regions  of  interest  and  to  query  by  map  type.  The 
ability  to  browse  maps  stored  in  GDB  or  download  them 
in  the  background  was  also  incorporated  into  GDB  6.0. 

GDB  stores  many  maps  of  each  chromosome,  generated 
by  a  variety  of  mapping  methods.  Users  who  are  inter- 
ested in  a  region,  such  as  the  neighborhood  of  a  gene  or 
marker,  will  be  able  to  see  all  maps  that  have  data  in  that 
region,  whether  or  not  they  contain  the  desired  marker.  To 
support  database  querying  by  region  of  interest,  inte- 
grated maps  have  been  developed  that  combine  data  from 
all  the  maps  for  each  chromosome.  These  are  called  Com- 
prehensive Maps. 

Queries  for  all  loci  in  a  region  of  interest  are  processed 
against  the  comprehensive  maps,  thereby  searching  all 
relevant  maps.  Comprehensive  maps  are  also  useful  for 
display  purposes  becau,se  they  organize  the  content  of  a 
region  by  class  of  locus  (e.g.,  gene,  marker,  clone)  rather 
than  by  data  source.  This  approach  yields  a  much  less 
complex  presentation  than  an  alignment  of  numerous  pri- 
mary maps.  Because  such  information  as  detailed  orders, 
order  discrepancies  between  maps,  and  nonlinear  metric 
relations  between  maps  is  not  always  captured  in  the 
comprehensive  maps,  GDB  continues  to  provide  access  to 
aligned  displays  of  primary  maps. 


A  Variety  of  Searching  Strategies 

Recognizing  the  eclectic  usercommu-nity's  need  to  search 
data  and  formulate  queries.  GDB  offers  a  spectrum  of 
simple  to  complex  search  strategies.  In  addition,  direct 
programming  access  is  available  using  either  GDB's  object 
query  language  to  the  Object  Broker  software  layer  or 
standard  query  language  to  the  underlying  Sybase  rela- 
tional database. 

Querying  by  Object  Directly  from  GDB's 
Home  Page 

The  simplest  methods  search  for  objects  according  to 
known  GDB  accession  numbers;  sequence  database- 
accession  numbers;  specified  names,  including  wildcard 
symbols  that  will  automatically  match  synonyms  and  pri- 
mary names;  and  keywords  contained  anywhere  in  the 
text. 

Querying  by  Region  of  Interest 

A  region  of  interest  can  be  specified  using  a  pair  of  flank- 
ing markers,  which  can  be  cytogenetic  bands,  genes, 
amplimers  (sequence  tagged  sites),  or  any  other  mapped 
objects.  Given  a  region  of  interest,  the  comprehensive 
maps  are  searched  to  find  all  loci  that  fall  within  them. 
These  loci  can  be  displayed  in  a  table,  graphically  as  a 
slice  through  a  comprehensive  map.  or  as  slices  through  a 
chosen  set  of  primary  maps.  A  comprehensive  map  slice 
shows  all  loci  in  the  region,  including  genes,  expressed 
sequence  tags  (ESTs),  amplimers,  and  clones.  A  region 
also  can  be  specified  as  a  neighborhood  around  a  single 
marker  of  interest. 

Results  of  queries  for  genes,  amplimers,  ESTs.  or  clones 
can  be  displayed  on  a  GDB  comprehensive  map.  Results 
are  spread  across  several  chromosomes  displayed  in 
Mapview.  A  query  for  all  the  PAX  genes  (specified  as  sym- 
bol =  PAX*  on  the  gene  query  form)  retrieves  genes  on 
multiple  chromosomes.  Double-clicking  on  one  of  these 
genes  brings  up  detailed  gene  information  via  the  Web 
browser. 

Querying  by  Polymorphism 

GDB  contains  a  large  number  of  polymorphisms  associ- 
ated with  genes  and  other  markers.  (Series  can  be  con- 
structed for  a  particular  type  of  marker  (e.g..  gene, 
amplimer.  clone),  polymorphism  (i.e.,  dinucleotide  repeat), 
or  level  of  heterozygosity.  These  queries  can  be  combined 
with  positional  queries  to  find,  for  example,  polymorphic 
amplimers  in  a  region  bounded  by  flanking  markers  or  in  a 
particular  chromosomal  band.  If  desired,  the  retrieved 
maiicers  can  be  viewed  on  a  comprehensive  map. 
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Work  in  Progress 


Mapvlew  2.3 

Mapview  2.1.  the  next  generation  of  the  GDB  map  viewer, 
was  released  in  March  1997.  The  latest  version. 
Mapview  2.3.  is  available  in  all  common  computing  envi- 
ronments because  it  is  written  in  the  Java  programming  lan- 
guage. Most  important,  the  new  viewer  can  display  mul- 
tiple aligned  maps  side  by  side  in  the  window,  with  align- 
ment lines  indicating  common  markers  in  neighboring 
maps.  As  before,  users  can  select  individual  markers  to  re- 
trieve more  information  about  them  from  the  database. 

GDB  developers  have  entered  into  a  collaborative  relation- 
.ship  with  other  members  of  the  bioWidget  Consortium  so 
the  Java-based  alignment  viewer  will  become  part  of  a  col- 
lection of  freely  available  .software  tools  for  displaying 
biological  data  (http://goodman.jax.0rg/project.1/ 
biowidgets/consortium). 

Future  plans  for  Mapview  include  providing  or  enhancing 
the  ability  to  generate  manuscript-ready  Postscript  map  im- 
ages, highlight  or  modify  the  display  of  particular  classes 
of  map  objecLs  based  on  attribute  values,  and  requery  for 
additional  information. 

Variation 

Since  its  inception,  GDB  has  been  a  repository  for  poly- 
morphism data,  with  more  than  1 8,000  polymorphisms 
now  in  GDB.  A  collaboration  has  been  initiated  with  the 
Human  Gene  Mutation  Databa.se  (HGMD)  based  in 
Cardiff,  Wales,  and  headed  by  David  Cooper  and  Michael 
Krawczak.  HGMD's  extensive  collection  of  human  muta- 
tion data,  covering  many  disease-causing  loci,  includes  se- 
quence-level mutation  characterizations.  This  data  set  will 
be  included  in  GDB  and  updated  from  HGMD  on  an  ongo- 
ing basis.  The  HGMD  team  also  will  provide  advice  on 
GDB's  representation  of  genetic  variation,  which  is  being 
enhanced  to  model  mutations  and  polymorphisms  at  the 
sequence  level.  These  modifications  will  allow  GDB  to  act 
as  a  repository  for  single-nucleotide  polymorphisms,  which 
are  expected  to  be  a  major  source  of  information  on  human 
genetic  variation  in  the  near  future. 

Mouse  Synteny 

Genomic  relationships  between  mouse  and  man  provide 
important  clues  regarding  gene  location,  phenotype,  and 
function.  One  of  GDB's  goals  is  to  enable  direct  compari- 
sons between  these  two  organisms,  in  collaboration  with 
the  Mouse  Genome  Database  at  Jackson  Laboratory.  GDB 
is  making  additions  to  its  schema  to  represent  this  infor- 
mation so  that  it  can  be  displayed  graphically  with 
Mapview.  In  addition,  algorithmic  work  is  under  way  to 


use  mapping  data  to  automatically  identify  regions  of  con- 
served synteny  between  mouse  and  man.  These  algorithms 
will  allow  the  synteny  maps  to  be  updated  regularly.  An 
important  application  of  comparative  mapping  is  the  ability 
to  predict  the  existence  and  location  of  unknown  human 
homologs  of  known,  mapped  mou.se  genes.  A  set  of  such 
predictions  is  available  in  a  report  at  the  GDB  Web  site. 
and  similar  data  will  be  available  in  the  database  itself  in 
the  spring  of  1998. 

Collaborations 

GDB  is  a  participant  in  the  Genome  Annotation  Consortium 
(GAC)  project,  whose  goal  is  to  produce  high-quality,  auto- 
matic annotation  of  genomic  sequences  (http://comphio. 
oml.gov/CoLah).  Currently.  GDB  is  developing  a  proto- 
type mechanism  to  transition  from  GDB's  Mapview  display 
to  the  GAC  sequence-level  browser  over  common  genome 
regions.  GAC  also  will  establish  a  human  genome  reference 
sequence  that  will  be  the  base  against  which  GDB  will  refer 
all  polymorphisms  and  mutations.  Ultimately,  every  ge- 
nomic object  in  GDB  should  be  related  to  an  appropriate 
region  of  the  reference  sequence. 

Sequencing  Progress 

The  sequencing  status  of  genomic  regions  now  can  be  re- 
corded in  GDB.  Based  on  submissions  to  sequence  data- 
ba.ses,  GAC  will  determine  genomic  regions  that  have  been 
completed.  GDB  also  will  be  collaborating  with  the  Euro- 
pean Bioinformatics  Institute,  in  conjunction  with  the  inter- 
national Human  Genome  Organisation  (HUGO),  to  main- 
tain a  single  shared  Human  Sequence  Index  that  will  record 
commitments  and  status  for  sequencing  clones  or  regions. 
As  a  result,  the  sequencing  status  of  any  region  can  be  dis- 
played alongside  other  GDB  mapping  data. 

Outreach 

The  Genome  Database  continues  to  seek  direct  community 
feedback  and  interact  with  the  broader  science  community 
via  various  sources: 

•  International  Scientific  Advisory  Committee  meets  an- 
nually to  offer  input  and  advice. 

•  Quarterly  Review  Committee  confers  frequently  with 
the  staff  to  track  GDB  progress  and  suggest  change. 

•  HUGO  nomenclature,  chromosome,  and  other  editorial 
committees  have  specialized  functions  within  GDB, 
providing  official  names  and  consensus  maps  and  en- 
suring the  high  quality  of  GDB's  content 

Copies  of  GDB  are  available  worldwide  from  ten  mirror 
sites  (nodes),  and  GDB  staff  members  meet  annually  with 
node  managers. 
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Research  Narratives 
National  Center  for  Genome  Resources 


Genome  Sequence  DataBase 
1800  Old  Pecos  Trail,  Suite  A 
Santa  Fe,  NM  87505 

Peter  Schad,  Vice-President 
Bioinformatics  and  Biotechnology 
505/995-4447,  Fax:  -4432 
cnc@ncgr.org 


Carol  Harger 
GSDB  Manager 
505/982-7840,  Fax:  -7690 
cah@ncgr.org 

http://www.ncgr.org 


The  National  Center  for  Genome  Resources  (NCGR)  is  a 
not-for-profit  organization  created  to  design,  develop,  sup- 
port, and  deliver  resources  in  support  of  public  and  private 
genome  and  genetic  research.  To  accomplish  these  goals, 
NCGR  is  developing  and  publishing  the  Genome  Se- 
quence DataBase  (GSDB)  and  the  Genetics  and  Public 
Issues  (GPI)  program. 

NCGR  is  a  center  to  facilitate  the  flow  of  information  and 
resources  from  genome  projects  into  both  public  and  pri- 
vate sectors.  A  broadly  based  board  of  governors  provides 
direction  and  strategy  for  the  center's  development. 

NCGR  opened  in  Santa  Fe  in  July  1994,  with  its  initial 
bioinformatics  work  being  developed  through  a  coopera- 
tive 5-year  agreement  with  the  Department  of  Energy 
funded  in  July  1995.  Committed  to  serving  as  a  resource 
for  all  genomic  research,  the  center  works  collaboratively 
with  researchers  and  seeks  input  from  users  to  ensure  that 
tools  and  projects  under  development  meet  their  needs. 

Genome  Sequence  DataBase 

GSDB  is  a  relational  database  that  contains  nucleotide  se- 
quence data  and  its  associated  annotation  from  all  known 
organisms  (hnp://www.ncgr.org/gsdh).  All  data  are  freely 
available  to  the  public.  The  major  goals  of  GSDB  are  to 
provide  the  support  structure  for  storing  sequence  data  and 
to  furnish  useful  data-retrieval  services. 

GSDB  adheres  to  the  philosophy  that  the  database  is  a 
"community-owned"  resource  that  should  be  simple  to  up- 
date to  reflect  new  discoveries  about  sequences.  A  corol- 
lary to  this  is  GSDB's  conviction  that  researchers  know 
their  areas  of  expertise  much  better  than  a  database  curator 
and,  therefore,  they  should  be  given  ownership  and  control 
over  the  data  they  submit  to  the  database.  The  true  role  of 
the  GSDB  staff  is  to  help  researchers  submit  data  to  and 
retrieve  data  from  the  database. 

GSDB  Enhancements 

During  1996,  GSDB  underwent  a  major  renovation  to  sup- 
port new  data  types  and  concepts  that  are  important  to  ge- 
nomic research.  Tables  within  the  database  were  restruc- 


tured, and  new  tables  and  data  fields  were  added.  Some 
key  additions  to  GSDB  include  the  support  of  data  owner- 
ship, sequence  alignments,  and  discontiguous  sequences. 

The  concept  of  data  ownership  is  a  cornerstone  to  the 
functioning  of  the  new  GSDB.  Every  piece  of  data  (e.g., 
sequence  or  feature)  within  the  database  is  owned  by  the 
submitting  researcher,  and  changes  can  be  made  only  by 
the  data  owner  or  GSDB  staff.  This  implementation  of  data 
ownership  provides  GSDB  with  the  ability  to  support  com- 
munity (third-party)  annotation — the  addition  of  annota- 
tion to  a  sequence  by  other  community  researchers. 

A  second  enhancement  of  GSDB  is  the  ability  to  store  and 
represent  .sequence  alignments.  GSDB  staff  has  been  con- 
structing alignments  to  several  key  sequences  including 
the  env  and  pol  (reverse  transcriptase)  genes  of  the  HIV 
genome,  the  complete  chromosome  VIII  of  Saccharomy- 
ces  cerevisiae,  and  the  complete  genome  of  Haemophilus 
influenzae.  These  alignments  are  useful  as  possible  sites  of 
biological  interest  and  for  rapidly  identifying  differences 
between  sequences. 

A  third  key  GSDB  enhancement  is  the  ability  to  represent 
known  relationships  of  order  and  distance  between  sepa- 
rate individual  pieces  of  sequence.  These  sets  of  sequences 
and  their  relative  positions  are  grouped  together  as  a  single 
discontiguous  sequence.  Such  a  sequence  may  be  as 
simple  as  two  primers  that  define  the  ends  of  a  sequence 
tagged  site  (STS),  it  may  comprise  all  exons  that  are  part 
of  a  single  gene,  or  it  may  be  as  complex  as  the  STS  map 
for  an  entire  chromosome. 

GSDB  staff  has  constructed  discontigu-ous  sequences  for 
human  chromosomes  1  through  22  and  X  that  include 
markers  from  Massachusetts  Institute  of  Technology- 
Whitehead  Institute  STS  maps  and  from  the  Stanford  Hu- 
man Genome  Center  The  set  of  2000  STS  markers  for 
chromosome  X,  which  were  mapped  recently  by  Washing- 
ton University  at  St.  Louis,  also  have  been  added  to  chro- 
mosome X.  About  50  genomic  sequences  have  been  added 
to  the  chromosome  22  map  by  determining  their  overlap 
with  STS  markers.  Genomic  sequences  are  being  added  to 
all  the  chromosomes  as  their  overlap  with  the  STS  markers 
is  determined.  These  discontiguous  sequences  can  be  re- 
trieved easily  and  viewed  via  their  sequence  names  using 
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GSDB 


the  GSDB  Annotator.  Sequence  names  follow  the  format 
of  HUMCHR#MP,  where  #  equals  1  through  22  or  X. 

GSDB  staff  also  has  utilized  discontigu-ous  sequences  to 
construct  maps  for  maize  and  rice.  The  maize  discontig- 
uous sequences  were  constructed  using  markers  from  the 
University  of  Missouri,  Columbia.  Markers  for  the  rice 
discontiguous  sequence  were  obtained  from  the  Rice  Ge- 
nome Database  at  Cornell  University  and  the  Rice  Ge- 
nome Research  Project  in  Japan. 

New  Tools 

As  a  result  of  the  major  GSDB  renovation,  new  tools  were 
needed  for  submitting  and  accessing  database  data.  Anno- 
tator was  developed  as  a  graphical  interface  that  can  be 
used  to  view,  update,  and  submit  sequence  data  (http:// 
www.ncgr.org/gsdh/beta.html).  Maestro,  a  Web-based  in- 
terface, was  developed  to  assist  researchers  in  data  re- 
trieval (http://www.ncgrorg/gsdb/maestroheta.html).  Al- 
though both  these  tools  currently  are  available  to  research- 
ers, GSDB  is  continuing  development  to  add  increased 
capabilities. 

Annotator  displays  a  sequence  and  its  associated  biological 
information  as  an  image,  with  the  scale  of  the  image  ad- 
justable by  the  user.  Additional  information  about  the  se- 
quence or  an  associate  biological  feature  can  be  obtained 
in  a  pop-up  window.  Annotator  also  allows  a  user  to  re- 
trieve a  sequence  for  review,  edit  existing  data,  or  add  an- 
notation to  the  record.  Sequences  can  be  created  using  An- 
notator, and  any  sequences  created  or  edited  can  be  saved 
either  to  a  local  file  for  later  review  and  further  editing  or 
saved  directly  to  the  database. 

Correct  database  structures  are  important  for  storing  data 
and  providing  the  research  community  with  tools  for 
searching  and  retrieving  data.  GSDB  is  making  a  con- 
certed effort  to  expand  and  improve  these  services.  The 
first  generation  of  the  Maestro  query  tool  is  available  from 
the  GSDB  Web  pages.  Maestro  allows  researchers  to  per- 
form queries  on  18  different  fields,  some  of  which  are 
queryable  only  through  GSDB,  for  example,  D  segment 
numbers  from  the  Genome  Database  at  Johns  Hopkins 
University  in  Baltimore. 

Additionally,  Maestro  allows  queries  with  mixed  Boolean 
operators  for  a  more  refined  search.  For  example,  a  user 
may  wish  to  compare  relatively  long  mouse  and  human 
sequences  that  do  not  contain  identified  coding  regions.  To 
obtain  all  sequences  meeting  these  criteria,  the  scientific 
name  field  would  be  searched  first  for  "Mus  musculus" 
and  then  for  "Homo  sapiens"  using  the  Boolean  term 
"OR."  Then  the  sequence-length  filter  could  be  used  to 
refine  the  search  to  sequences  longer  than  10,000  base 
pairs.  To  exclude  sequences  containing  identified  coding- 


region  features,  the  "BUT  NOT'  term  can  be  used  with  the 
Feature  query  field  set  equal  to  "coding  region." 

With  Maestro,  users  can  view  the  list  of  search  matches  a 
few  at  a  time  and  retrieve  more  of  the  list  as  needed.  From 
the  list,  users  can  select  one  or  several  sequences  accord- 
ing to  their  short  descriptions  and  review  or  download  the 
sequence  information  in  GIO,  FASTA,  or  GSDB  flatfile 
format. 

Future  Plans 

Although  most  pieces  necessary  for  operation  are  now  in 
place,  GSDB  is  still  improving  functionality  and  adding 
enhancements.  During  the  next  year  GSDB,  in  collabora- 
tion with  other  researchers,  anticipates  creating  more 
discontiguous  sequence  maps  for  several  model  organisms, 
adding  more  functionality  to  and  providing  a  Web-based 
submission  tool  and  tool  kit  for  creating  GIO  files. 

Microbial  Genome  Web  Page 

NCGR  also  maintains  informational  Web  pages  on  micro- 
bial genomes.  These  pages,  created  as  a  community  refer- 
ence, contain  a  list  of  current  or  completed  eubacterial, 
Archaeal,  and  eukaryotic  genome  sequencing  projects. 
Each  main  page  includes  the  name  of  the  organism  being 
sequenced,  sequencing  groups  involved,  background  infor- 
mation on  the  organism,  and  its  current  location  on  the 
Carl  Woese  Tree  of  Life.  As  the  Microbial  Genome  Project 
progresses,  the  pages  will  be  updated  as  appropriate. 

Genetics  and  Public  Issues  Program 

GPI  serves  as  a  crucial  resource  for  people  seeking  infor- 
mation and  making  decisions  about  genetics  or  genomics 
(hnp://www.ncgrorg/gpi).  GPI  develops  and  provides  in- 
formation that  explains  the  ethical,  legal,  policy,  and  social 
relevance  of  genetic  discoveries  and  applications. 

To  achieve  its  mission,  GPI  has  set  forth  three  goals: 
(1)  preparation  and  development  of  resources,  including 
careful  delineation  of  ethical,  legal,  policy,  and  social  is- 
sues in  genetics  and  genomics;  (2)  dissemination  of  ge- 
netic information  targeted  to  the  public,  legal  and  health 
professionals,  policymakers,  and  decision  makers;  and  (3) 
creation  of  an  information  network  to  facilitate  interaction 
among  groups. 

GPI  delivers  information  through  four  primary  vehicles: 
online  resources,  conferences,  publications,  and  educa- 
tional programs.  The  GPI  program  maintains  a  continually 
evolving  World  Wide  Web  site  containing  a  range  of  mate- 
rial freely  accessible  over  the  Internet. 
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Discovering  Genes 
for  New  Medicines 

By  identifying  human  genes  involved  in  disease, 

researchers  can  create  potentially  therapeutic  proteins 

and  speed  the  development  of  powerful  drugs 

bv  William  A.  Haseltinc 


Mosr  readers  of  rhis  maga- 
zine are  probably  familiar 
with  the  idea  of  a  gene  as 
something  that  transmits  inherited  traits 
from  one  generation  to  the  next.  Less 
well  appreciated  is  that  malfunctionmg 
genes  are  deeply  involved  m  most  dis- 
eases, not  only  mherited  ones.  Cancer, 
atherosclerosis,  osteoporosis,  arthritis 
and  Alzheimer's  disease,  for  example, 
are  all  characterized  by  specific  changes 
in  the  activities  of  genes.  Even  infec- 
tious disease  usually  provokes  the  aai- 


vation  of  identifiable  genes  in  a  patient  s 
immune  system.  Moreover,  accumulat- 
ed damage  to  genes  from  a  lifetime  of 
exposure  to  ionizing  radiation  and  in|u- 
nous  chemicals  probably  underlies  some 
of  the  changes  associated  with  aging. 

A  few  years  ago  I  and  some  like- 
minded  colleagues  decided  that  know- 
ing where  and  when  different  genes  .ire 
switched  on  in  the  human  body  would 
lead  to  far-reaching  advances  in  our  abil- 
ity to  predict,  prevent,  treat  and  cure  dis- 
ease. When  a  gene  is  aaive,  or  as  a  ge- 


neticist would  say,  "expressed,"  the  se- 
quence of  the  chemical  units,  or  bases, 
m  its  DNA  is  used  as  a  blueprint  to  pro- 
duce a  specific  protein.  Proteins  direct, 
in  various  ways,  all  of  a  cell's  funaions. 
They  serve  as  structural  components,  as 
catalysts  that  carry  out  the  mulriple 
chemical  processes  of  life  and  as  control 
elements  that  regulate  cell  reproduction, 
cell  specialization  and  physiological  ac- 
iivirv  at  all  levels.  The  development  of  a 
human  from  fertilized  egg  to  mature 
adult  is,  in  fact,  the  consequence  of  an 
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orderly  change  in  the  pattern  of  gene 
expression  in  different  tissues. 

Knowing  which  genes  are  expressed 
in  healthy  and  diseased  tissues,  we  real- 
ized, would  allow  us  to  identify  both 
the  proteins  required  for  normal  func- 
tioning of  tissues  and  the  aberrations 
involved  in  disease.  With  that  informa- 
tion in  hand,  it  would  be  possible  to  de- 
velop new  diagnostic  tests  for  various 
illnesses  and  new  drugs  to  alter  the  ac- 
tivity of  affected  proteins  or  genes.  Inves- 
tigators might  also  be  able  to  use  some 
of  the  proteins  and  genes  we  identified 
as  therapeutic  agents  in  their  own  right. 
We  envisaged,  in  a  sense,  a  high-resolu- 
tion description  of  human  anatomy  de- 
scending to  the  molecular  level  of  detail. 

It  was  clear  that  identifying  all  the  ex- 
pressed genes  in  each  of  the  dozens  of 
tissues  in  the  body  would  be  a  huge  task. 
There  are  some  100,000  genes  in  a  typi- 
cal human  cell.  Only  a  small  proportion 
of  those  genes  (typically  about  15,000) 
is  expressed  in  any  one  type  of  cell,  but 
the  expressed  genes  vary  from  one  cell 
type  to  another.  So  looking  at  just  one 
or  two  cell  types  would  not  reveal  the 
genes  expressed  in  the  rest  of  the  body. 
We  would  also  have  to  study  tissues 
from  all  the  stages  of  human  develop- 
ment. Moreovei;  to  identify  the  changes 
in  gene  expression  that  contribute  to 


sickness,  we  would  have  to  analyze  dis- 
eased as  well  as  healthy  tissues. 

Technological  advances  have  provid- 
ed a  way  to  get  the  job  done.  Scientists 
can  now  rapidly  discover  which  genes 
are  expressed  in  any  given  tissue.  Our 
strategy  has  proved  the  quickest  way  to 
identif)'  genes  of  medical  importance. 

Take  the  example  of  atherosclerosis. 
In  this  common  condition,  a  fatty  sub- 
stance called  plaque  accumulates  inside 
arteries,  notably  those  supplying  the 
heart.  Our  strategy  enables  us  to  gener- 
ate a  list  of  genes  expressed  in  normal 
arteries,  along  with  a  measure  of  the 
level  of  expression  of  each  one.  We  can 
rhen  compare  the  list  with  one  derived 
from  patients  with  atherosclerosis.  The 
difference  between  the  lists  corresponds 
to  the  genes  {and  thus  the  proteins}  in- 
volved in  the  disease.  It  also  indicates 
how  much  the  genes'  expression  has 
been  increased  or  decreased  by  the  ill- 
ness. Researchers  can  then  make  the  hu- 
man proteins  specified  by  those  genes. 

Once  a  protein  can  be  manufactured 
in  a  pure  form,  scientists  can  fairly  easi- 
ly fashion  a  test  to  detect  it  in  a  panent. 
A  test  to  reveal  overproduction  of  a  pro- 
tein found  in  plaque  might  expose  early 
signs  of  atherosclerosis,  when  better 
options  exist  for  treating  it.  In  addition, 
pharmacologists  can  use  pure  proteins 


to  help  them  find  new  drugs.  A  chemi- 
cal that  inhibited  production  of  a  pro- 
tein found  in  plaque  might  be  consid- 
ered as  a  drug  to  treat  atherosclerosis. 

Our  approach,  which  I  call  medical 
genomics,  is  somewhat  outside  the  main- 
stream of  research  in  human  genetics.  A 
great  many  scientists  are  involved  in  the 
Human  Genome  Project,  an  internation- 
al collaboration  devoted  to  the  discov- 
ery of  the  complete  sequence  of  the 
chemical  bases  in  human  DNA.  (All  the 
codes  in  DNA  are  constructed  from  an 
alphabet  consisting  of  just  four  bases.) 
That  information  will  be  important  for 
studies  of  gene  action  and  evolution  and 
will  particularly  benefit  research  on  in- 
herited diseases.  Yet  the  genome  projea 
is  not  the  fastest  way  to  discover  genes, 
because  most  of  the  bases  that  make  up 
DNA  actually  lie  outside  genes.  Nor  will 
the  project  pinpoint  which  genes  are  in- 
volved in  illness. 

In  1 992  we  created  a  company.  Hu- 
man Genome  Sciences  (HGS),  to  pursue 
our  vision.  Initially  we  conducted  the 
work  as  a  collaboration  between  HGS 
and  the  Institute  for  Genomic  Research, 
a  not-for-profit  organization  that  HGS 
supports;  the  institute's  directoi;  J.  Craig 
Ventei;  pioneered  some  of  the  key  ideas 
in  genomic  research.  Six  months  into 
the  collaboration,  SmithKline  Beecham, 
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How  to  Find  a  Partial  cDNA  Sequence 

archers  find  partial  cDNA  sequences  by  chemically  breaking  down 

•  to  create  an  array  of  fragments  that  differ  in 

\'tf>e  base  at  one  end  of  each  fragment  is  at- 
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c  iatiels  one  by  one.  The  result 
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ATT AGCATCA AC GTGACCGTGA 


one  of  the  world's  largest  pharmaceuti- 
cal companies,  joined  HGS  in  the  effort. 
After  the  first  yean  HGS  and  SmithKline 
Beecham  continued  on  their  own.  We 
were  jomed  later  by  Schering-Plough, 
Takcda  Chemical  Industries  in  Japan, 
Merck  KGaA  in  Germany  and  Synthe- 
labo  in  France. 

Genes  by  the  Direct  Route 

Because  the  key  to  developing  new 
medicines  lies  principally  in  the  pro- 
teins produced  by  human  genes,  rather 
than  the  genes  themselves,  one  might 
wonder  why  we  bother  with  the  genes 
at  all.  We  could  in  principle  analyze  a 
cell's  proteins  directly.  Knowing  a  pro- 
tein's composition  does  not,  howevei;  al- 
low us  to  make  it,  and  to  develop  medi- 
cines, we  must  manufacture  substantial 
amounts  of  proteins  that  seem  impor- 
tant. The  only  practical  way  to  do  so  is 
to  isolate  the  corresponding  genes  and 
transplant  them  into  cells  that  can  ex- 
press those  genes  in  large  amounts. 

Our  method  for  finding  genes  focuses 
on  a  critical  intermediate  produa  creat- 
ed in  cells  whenever  a  gene  is  expressed. 
This  intermediate  product  is  called  mes- 
senger RNA  (mRNA);  like  DNA,  it  con- 
sists of  sequences  of  four  bases.  When  a 
cell  makes  mRNA  from  a  gene,  it  essen- 
tially copies  the  sequence  of  DNA  bases 
in  the  gene.  The  mRNA  then  serves  as  a 
template  for  constructing  the  specific 
protein  encoded  by  the  gene.  The  value 
of  mRNA  for  research  is  that  cells  make 
it  only  when  the  corresponding  gene  is 
aaive.  Yet  the  mRNA's  base  sequence, 
being  simply  related  to  the  sequence  of 
the  gene  itself,  provides  us  with  enough 
information  to  isolate  the  gene  from  the 
total  mass  of  DNA  in  cells  and  to  make 
its  protein  if  we  want  to. 

For  our  purposes,  the  problem  with 
mRNA  was  that  it  can  be  difficult  to 
handle.  So  we  in  faa  work  with  a  surrcn 
gaie:  stable  DNA  copies,  called  comple- 
mentary DNAs  (cDNAs)  of  the  mRNA 
molecules.  We  make  the  cDNAs  by  sim- 
ply reversing  the  process  the  cell  uses  to 
make  mRNA  from  DNA. 

The  cDNA  copies  we  produce  this 
way  are  usually  replicas  of  segments  of 
mRNA  rather  than  of  the  whole  mole- 
cule, which  can  be  many  thousands  of 
bases  long.  Indeed,  different  parts  of  a 
gene  can  give  nse  to  cDNAs  whose  com- 
mon origin  may  not  be  immediately  ap- 
parent. Nevertheless,  a  cDNA  contain- 
ing just  a  few  thousand  bases  still  pre- 
serves its  parent  gene's  unique  signature. 
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Thac  is  because  It  is  vanishingly  unlike- 
ly that  two  different  genes  would  share 
an  identical  sequence  thousands  of  bas- 
es long.  Just  as  a  random  chapter  taken 
from  a  book  uniquely  Identifies  the 
book,  so  a  cDNA  molecule  uniquely 
identifies  the  gene  that  gave  rise  to  it. 

Once  we  have  made  a  cDNA,  we  can 
copy  it  to  produce  as  much  as  we  want. 
That  means  we  will  have  enough  mate- 
rial for  determining  the  order  of  its  bas- 
es. Because  we  know  the  rules  that  cells 
use  to  turn  DNA  sequences  into  the  se- 
quences of  amino  acids  that  constitute 
proteins,  the  ordering  of  bases  tells  us 
the  amino  acid  sequence  of  the  corre- 
sponding protein  fragment.  That  se- 
quence, in  turn,  can  be  compared  with 
the  sequences  in  proteins  whose  struc- 
tures are  known.  This  maneuver  often 
tells  us  something  about  the  function  of 
the  complete  protein,  because  proteins 
containing  similar  sequences  of  amino 
acids  often  perform  similar  tasks. 

Analyzing  cDNA  sequences  used  to 
be  extremely  rime<onsumlng,  but  in  re- 
cent years  biomedical  Instruments  have 
been  developed  that  can  perform  the 
task  reliably  and  automatically.  Anoth- 
er development  was  also  necessary  to 
make  our  strategy  feasible.  Sequencing 
equipment,  when  operated  on  the  scale 
we  were  contemplating,  produces  gar- 
gantuan amounts  of  data.  Happily,  com- 
puter systems  capable  of  handling  the 
resulting  megabytes  are  now  available, 
and  we  and  others  have  written  software 
that  helps  us  make  sense  of  this  wealth 
of  genetic  detail. 

Assembling  the  Puzzle 

Our  technique  for  identifying  the 
genes  used  by  a  cell  is  to  analyze  a 
sequence  of  300  to  500  bases  at  one 
end  of  each  cDNA  molecule.  These  par- 
rial  cDNA  sequences  aa  as  markers  for 
genes  and  are  sometimes  referred  to  as 
e.<pressed  sequence  tags.  We  have  cho- 
sen this  length  for  our  parrial  cDNA  se- 
quences because  it  is  short  enough  to 
analyze  fairly  quickly  but  srill  long 
enough  to  idenrify  a  gene  unambiguous- 
ly. If  a  cDNA  molecule  is  like  a  chapter 
from  a  book,  a  partial  sequence  is  like 
the  first  page  of  the  chapter— it  can  iden- 
tify the  book  and  even  give  us  an  idea 
what  the  book  is  about.  Partial  cDNA 
sequences,  likewise,  can  tell  us  some- 
thing about  the  gene  they  derive  from. 
.■\t  HCS,  we  produce  about  a  million 
bases  of  raw  sequence  data  every  day. 
Our  nwthod  is  proving  successful:  in 


less  than  five  years  we  have  identified 
thousands  of  genes,  many  of  which  may 
play  a  part  in  illness.  Other  companies 
and  academic  researchers  have  also  ini- 
tiated programs  to  generate  partial 
cDNA  sequences. 

HGS's  computers  recognize  many  of 
the  partial  sequences  we  produce  as  de- 
riving either  from  one  of  the  6,000 


PERCENTAGE  OF  GENfES  devoted  to 
each  of  the  major  activities  in  the  typical 
human  cell  has  been  deduced  from  a  study 
of  150,000  partial  sequences.  Similarities 
with  human  or  other  genes  of  known  func- 
tion were  used  to  assign  provisional  cate- 
gories of  activity. 

genes  researchers  have  already  idenrified 
by  other  means  or  from  a  gene  we  have 
previously  found  ourselves.  When  we 
cannot  definitely  assign  a  newly  gener- 
ated partial  sequence  to  a  known  gene, 
things  get  more  interesting.  Our  com- 
puters then  scan  through  our  databases 
as  well  as  public  databases  to  see  wheth- 
er the  new  partial  sequence  overlaps 
something  someone  has  logged  before. 
When  we  find  a  clear  overlap,  we  piece 
together  the  overlapping  partial  se- 
quences into  ever  lengthening  segments 
called  contlgs.  Conrigs  correspond,  then, 
to  incomplete  sequences  we  infer  to  be 
present  somewhere  in  a  parent  gene. 
This  process  is  somewhat  analogous  to 
fishing  out  the  phrases  "a  midnight 
dreary,  while  I  pondered"  and  "while  I 
pondered,  weak  and  weary/Over  many 
a . . .  volume"  and  combining  them  into 
a  fragment  recognizable  as  part  of  Ed- 
gar Allan  Poe's  "The  Raven." 

At  the  same  rime,  we  attempt  to  de- 
duce the  likely  function  of  the  protein 
corresponding  to  the  partial  sequence. 
Once  we  have  predicted  the  protein's 
structure,  we  classify  it  according  to  its 
similarity  to  the  structures  of  known 
proteins.  Somerimes  we  find  a  match 


with  another  human  protein,  but  often 
we  notice  a  match  with  one  from  a  bac- 
terium, fungus,  plant  or  insect:  other 
organisms  produce  many  proteins  simi- 
lar In  funaion  to  those  of  humans.  Our 
computers  continually  update  these 
provisional  classificarions. 

Three  years  ago,  for  example,  we  pre- 
dicted that  genes  containing  four  spe- 
cific contlgs  would  each  produce  pro- 
teins similar  to  those  known  to  correct 
mutations  in  the  DNA  of  baaeria  and 
yeast.  Because  researchers  had  learned 
that  failure  to  repair  mutarions  can  cause 
colon  cancer,  we  started  to  work  out  the 
full  sequences  of  the  four  genes.  When 
a  prominent  colon  cancer  researcher 
later  approached  us  for  help  in  identify- 
ing genes  that  might  cause  that  illness — 
he  already  knew  about  one  such  gene — 
we  were  able  to  tell  him  that  we  were 
already  working  with  three  additional 
genes  that  might  be  involved. 

Subsequent  research  has  confirmed 
that  mutations  in  any  one  of  the  four 
genes  can  cause  life-threatening  colon, 
ovarian  or  endometrial  cancer.  As  many 
as  one  in  every  200  people  in  North 
America  and  Europe  carry  a  mutation 
In  one  of  these  mismatch  repair  genes, 
as  they  are  called.  Knowing  this,  scien- 
tists can  develop  tests  to  assess  the  mis- 
match repair  genes  in  people  who  have 
relatives  with  these  cancers.  If  the  peo- 
ple who  are  tested  display  a  genetic  pre- 
disposition to  illness,  they  can  be  moni- 
tored closely.  Prompt  detection  of  tu- 
mors can  lead  to  lifesaving  surgery,  and 
such  tests  have  already  been  used  in  clin- 
ical research  to  idenrify  people  at  risk. 

Our  database  now  contains  more  than 
a  million  cDNA-derived  partial  gene  se- 
quences, sorted  into  170,000  contigs. 
We  think  we  have  parrial  sequences 
from  almost  all  expressed  human  genes. 
One  indication  is  that  when  other  sci- 
entists log  gene  sequences  into  public 
databases,  we  find  that  we  already  have 
a  parrial  sequence  for  more  than  95  per- 
cent of  them.  Piecing  together  parrial  se- 
quences frequently  uncovers  entire  new 
genes.  Overall  more  than  half  of  the  new 
genes  we  idenrify  have  a  resemblance  to 
known  genes  that  have  been  assigned  a 
probable  funaion.  As  rime  goes  by,  this 
proportion  is  likely  to  increase. 

If  a  rissue  gives  rise  to  an  unusually 
large  number  of  cDNA  sequences  that 
derive  from  the  same  gene,  it  provides 
an  indicarion  that  the  gene  in  quesrion  is 
producing  copious  amounts  of  mRNA. 
That  generally  happens  when  the  cells 
are  producing  large  amounts  of  the  cor- 
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responding  protein,  suggesting  that  the 
protein  may  be  doing  a  particularly  vi- 
tal job.  HGS  also  pays  particular  atten- 
tion CO  genes  that  are  expressed  only  in 
a  narrow  range  of  tissues,  because  such 
genes  are  most  likely  to  be  useful  for  in- 
tervening in  diseases  a^ecting  those  tis- 
sues. Of  the  thousands  of  genes  we  have 
discovered,  we  have  identified  about  300 
that  seem  especially  Hkcly  to  be  medi- 
cally imponant. 

New  Genes,  New  Medicines 

Using  the  partial  cDNA  sequence 
technique  for  gene  discovery,  re- 
searchers have  for  the  first  time  been 
able  to  assess  how  many  genes  are  de- 
voted to  each  of  the  main  cellular  func- 
tions, such  as  defense,  metabolism  and 
so  on.  The  vast  store  of  unique  infor- 
mation from  partial  cDNA  sequences 
offers  new  possibilities  for  medical  sci- 
ence. These  opportunities  are  now  being 
systematically  explored. 

Databases  such  as  ours  have  already 
proved  their  value  for  finding  proteins 
that  are  useful  as  signposts  of  disease. 
Prostate  cancer  is  one  example.  A  wide- 
ly used  test  for  detecting  prostate  cancer 
measures  levels  in  the  blood  of  a  protein 
called  prostate  specific  antigen.  Patients 


who  have  prostate  cancer  often  exhibit 
unusually  high  levels.  Unfortunately, 
slow-growings  relatively  benign  tumors 
as  well  as  malignant  tumors  requinng 
aggressive  therapy  can  cause  elevated 
levels  of  the  antigen,  and  so  the  test  is 
ambiguous. 

HGS  and  its  partners  have  analyzed 
mRNAs  from  multiple  samples  of 
healthy  prostate  tissue  as  well  as  from 
benign  and  malignant  prostate  tumors. 
We  found  about  300  genes  that  are  ex- 
pressed in  the  prostate  but  in  no  other 
tissue;  of  these,  about  100  are  aaive 
only  in  prostate  tumors,  and  about  20 
are  expressed  only  in  tumors  rated  by 
pathologists  as  malignant.  We  and  our 
commercial  partners  are  using  these  20 
genes  and  their  protein  products  to  de- 
vise tests  to  identify  malignant  prostate 
disease.  We  have  similar  work  under  way 
for  breast,  lung,  liver  and  brain  cancers. 

Databases  of  partial  cDNA  sequenc- 
es can  also  help  find  genes  responsible 
for  rare  diseases.  Researchers  have  long 
known,  for  example,  that  a  certain  form 
of  blindness  in  children  is  the  result  of  an 
inherited  defea  in  the  chemical  break- 
down of  the  sugar  galactose.  A  search 
of  our  database  revealed  two  previous- 
ly unknown  human  genes  whose  corre- 
sponding proteins  were  prediaed  to  be 


ROBOT  used  to  distinguish  bacterial  col- 
onies that  have  picked  up  human  DNA 
sequences  is  at  the  top.  The  instrument's 
arms  ignore  colonies  that  are  blue,  the 
sign  that  they  contain  no  human  DNA. 
By  analyzing  the  sequences  in  the  bacte- 
ria, researchers  can  identify  human  genes. 


struaurally  similar  to  known  galactose- 
metabolizing  enzymes  in  yeast  and  bac- 
teria. Investigators  quickly  confirmed 
that  inherited  defeas  in  either  of  these 
two  genes  cause  this  type  of  blindness. 
In  the  future,  the  enzymes  or  the  genes 
themselves  might  be  used  to  prevent  the 
affliction. 

Partial  cDNA  sequences  are  also  es- 
tablishing an  impressive  record  for  help- 
ing researchers  to  find  smaller  molecules 
that  are  candidates  to  be  new  treat- 
ments. Methods  for  creating  and  testing 
small-molecule  drugs— the  most  com- 
mon type — have  improved  dramatically 
in  the  past  few  years.  Automated  equip- 
ment can  rapidly  screen  natural  and  syn- 
thetic compounds  for  their  ability  to  af- 
fea  a  human  protein  involved  in  disease, 
but  the  limited  number  of  known  pro- 
tein targets  has  delayed  progress.  As 
more  human  proteins  are  investigated, 
progress  should  accelerate.  Our  work  is 
now  providing  more  than  half  of  Smith- 
Kline  Beecham's  leads  for  potential 
produas. 

Databases  such  as  ours  not  only  make 
it  easier  to  screen  molecules  randomly 
for  useful  activity.  Knowing  a  protein's 
struaure  enables  scientists  to  custom- 
design  drugs  to  interaa  in  a  specific  way 
with  the  protein.  This  teclmique,  known 
as  ranonal  drug  design,  was  used  to  cre- 
ate some  of  the  new  protease  inhibitors 
that  are  proving  effeaive  against  HFV 
(although  our  database  was  not  involved 
in  this  particular  effort).  We  are  confi- 
dent that  partial  cDNA  sequences  will 
allow  pharmacologists  to  make  more 
use  of  rational  drug  design. 

One  example  of  how  our  database 
has  already  proved  useful  concerns  cells 
known  as  osteoclasts,  which  are  normal- 
ly present  in  bone;  these  cells  produce 
an  enzyme  capable  of  degrading  bone 
tissue.  The  enzyme  appears  to  be  pro- 
duced in  excess  in  some  disease  sutes, 
such  as  osteoarthritis  and  osteoporosis. 
We  found  in  our  computers  a  sequence 
for  a  gene  expressed  in  osteoclasts  that 
appeared  to  code  for  the  destructive  en- 
zyme; its  sequence  was  similar  to  that  of 
a  gene  known  to  give  rise  to  an  enzyme 
that  degrades  cartilage.  We  cotifirmed 


Scientific  American     March  1997 


Discovering  Genes  for  New  Medicines 


343 


that  the  osteoclast  gene  was  responsible 
for  the  degradative  enzyme  and  also 
showed  that  it  is  not  expressed  in  other 
tissues.  Those  discoveries  meant  we 
could  invent  ways  to  thwart  the  gene's 
protein  without  worrying  that  the  meth- 
ods would  harm  other  tissues.  We  then 
made  the  protein,  and  SmithKline  Beech- 
am  has  used  it  to  identify  possible  ther- 
apies by  a  combination  of  high-through- 
put screening  and  rational  drug  design. 
The  company  has  also  used  our  data- 
base to  screen  for  molecules  that  might 
be  used  to  treat  atherosclerosis. 

One  extremely  rich  lode  of  genes  and 
proteins,  from  a  medical  pomt  of  view, 
is  a  class  known  as  G-protein  coupled 
receptors.  These  proteins  span  the  cell's 
outer  membrane  and  convey  biological 
signals  from  other  cells  into  the  cell's  in- 
terior. It  is  likely  that  drugs  able  to  in- 
hibit such  vital  receptors  could  be  used 
to  treat  diseases  as  diverse  as  hyperten- 
sion, ulcers,  migraine,  asthma,  the  com- 
mon cold  and  psychiatric  disorders. 
HGS  has  found  more  than  70  new  G- 
protein  coupled  receptors.  We  are  now 
testing  their  effects  by  introducing  re- 
ceptor genes  we  have  discovered  into 
cells  and  evaluating  how  the  cells  that 
make  the  encoded  proteins  respond  to 
various  stimuli.  Two  genes  that  are  of 
special  interest  produce  proteins  that 
seem  to  be  critically  involved  in  hyper- 
tension and  in  adult-onset  diabetes.  Our 
partners  in  the  pharmaceutical  industr\- 
are  searching  for  small  molecules  that 
should  inhibit  the  biological  signals 
transmitted  by  these  receptors. 

Last  but  not  least,  our  research  sup- 
ports our  belief  that  some  of  the  human 
genes  and  protems  we  are  now  discov- 
ering will,  perhaps  in  modified  form, 
themselves  constitute  new  therapies. 
Many  human  proteins  are  already  used 
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HUMAN  PROTECTS  made  after  their  genes  were  discovered  at  Human  Genome  Sci- 
ences include  several  that  demonstrate  powerful  effects  in  isolated  cells  and  in  experi- 
mental animals.  These  examples  are  among  a  number  of  human  proteins  now  being 
tested  to  discover  their  possible  medical  value. 


as  drugs;  insulin  and  clotting  faaor  for 
hemophiliacs  are  well-known  examples. 
Proteins  that  stimulate  the  production  of 
blood  cells  are  also  used  to  speed  pa- 
tients' recovery  from  chemotherapy. 

The  proteins  of  some  200  of  the  full- 
length  gene  sequences  HGS  has  uncov- 
ered have  possible  applications  as  medi- 
cines. We  have  made  most  of  these  pro- 
teins and  have  insrituted  tests  of  their 
activity  on  cells.  Some  of  them  are  also 
proving  promising  in  tests  using  experi- 
mental animals.  The  proteins  include 
several  chemokines,  molecules  that  stim- 
ulate immune  system  cells. 

Developing  pharmaceuticals  will  nev- 
er be  a  quick  process,  because  medicines, 
whether  proteins,  genes  or  small  mole- 
cules, have  to  be  extensively  tested.  Nev- 
ertheless, partial  cDNA  sequences  can 
speed  the  discovery  of  candidate  thera- 


pies. HGS  allows  academic  researchers 
access  to  much  of  its  database,  although 
we  ask  for  an  agreement  to  share  royal- 
ties from  any  ensuing  products. 

The  systematic  use  of  automated  and 
computerized  methods  of  gene  discov- 
ery has  yielded,  for  the  first  time,  a  com- 
prehensive picture  of  where  different 
genes  are  expressed— the  anatomy  of 
human  gene  expression.  In  addition,  we 
are  starting  to  learn  about  the  changes 
in  gene  expression  in  disease.  It  is  too 
early  to  know  exactly  when  physicians 
will  first  successfully  use  this  knowl- 
edge to  treat  disease.  Our  analyses  pre- 
dict, however,  that  a  number  of  the  re- 
sulting therapies  will  form  mainstays  of 
21st-century  mediane.  □ 

To  obtain  high-quality  reprints  of  this 
article,  please  see  page  123. 
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ATTHEENDOFTHEROAD  in  Little 
Cottonwood  Canyon,  near  Salt 
Lake  City,  Alta  is  a  place  of 
near-mythic  renown  among 
skiers.  In  time  it  may  well 
assume  similar  status  among  molecular 
geneticists.  In  December  1984,  a  conference 
there,  co-sponsored  by  the  U.S.  Department 
of  Energy,  pondered  a  single  question:  Does 
modern  DNA  research  offer  a  way  of  detect- 
ing tiny  genetic  mutations  — and,  in  particu- 
lar, of  observing  any  increase  in  the  mutation 
rate  among  the  survivors  of  the  Hiroshima 
and  Nagcisaki  bombings  and  their  descen- 
dants? In  short  the  answer  was.  Not  yet. 
But  in  an  atmosphere  of  rare  intellectual  fer- 
tility, the  seeds  were  sown  for  a  project  that 
would  make  such  detection  possible  in  the 
future— the  Human  Genome  Project. 

In  the  months  that  followed,  much 
deliberation  and  debate  ensued.  But  in  1986, 
the  DOE  took  a  bold  and  unilateral  step  by 
announcing  its  Human  Genome  Initiative, 
convinced  that  its  mission  would  be  well 
served  by  a  comprehensive  picture  of  the 
human  genome.  The  immediate  response 
was  considerable  skepticism— skepticism 
about  the  scientific  community's  technologi- 
cal wherewithal  for  sequencing  the  genome 
at  a  reasonable  cost  and  about  the  value  of 
the  result,  even  if  it  could  be  obtained  eco- 
nomically. 

Things  have  changed.  Today,  a  decade 
later,  a  worldwide  effort  is  under  way  to 
develop  and  apply  the  technologies  needed  to 
completely  map  and  sequence  the  human 
genome,  as  well  as  the  genomes  of  several 
model   organisms.      Technological   progress 


has  been  rapid,  and  it  is  now  generally  agreed 
that  this  international  project  will  produce 
the  complete  sequence  of  the  human  genome 
by  the  year  2005. 

And  what  is  more  important,  the  value 
of  the  project  also  appears  beyond  doubt. 
Genome  research  is  revolutionizing  biology 
and  biotechnology,  and  providing  a  vital 
thrust  to  the  increasingly  broad  scope  of  the 
biological  sciences.  The  impact  that  will  be 
felt  in  medicine  and  health  care  alone,  once 
we  identify  all  human  genes,  is  inestimable. 
The  project  hjis  already  stimulated  signifi- 
cant investment  by  large  corporations  and 
prompted  the  creation  of  new  companies  hop- 
ing to  capitalize  on  its  profound  implications. 

But  the  doe's  early,  catalytic  decision 
deserves  further  comment.  The  organizers  of 
the  doe's  genome  initiative  recognized  that 
the  information  the  project  would  generate  — 
both  technological  and  genetic  — would  con- 
tribute not  only  to  a  new  understanding  of 
human  biology,  but  also  to  a  host  of  practical 
applications  in  the  biotechnology  industry 
and  in  the  arenas  of  agriculture  and  environ- 
mental protection.  A  1987  report  by  a  DOE 
advisory  committee  provided  some  examples. 
The  committee  foresaw  that  the  project  could 
ultimately  lead  to  the  efficient  production  of 
biomass  for  fuel,  to  improvements  in  the 
resistence  of  plants  to  environmental  stress, 
and  to  the  practical  use  of  genetically  engi- 
neered microbes  to  neutralize  toxic  wastes. 
The  Department  thus  saw  far  more  to  the 
genome  project  than  a  promised  tool  for 
assessing  mutation  rates.  For  example, 
understanding  the  human  genome  will  have 
an  enormous  impact  on  our  ability  to  assess, 
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individual  by  individual,  the  risk  posed  by 
environmental  exposures  to  toxic  agents.  We 
know  that  genetic  differences  make  some  of 
us  more  susceptible,  and  others  more  resis- 
tant, to  such  agents.  Far  more  work  must  be 
done  before  we  understand  the  genetic  basis 
of  such  variability,  but  this  knowledge  will 
directly  address  the  DOE's  long-term  mis- 
sion to  understand  the  effects  of  low-level 
exposures  to  radiation  and  other  energy- 
related  agents  — especially  the  effects  of 
such  exposure  on  cancer  risk.  And  the 
genome  project  is  a  long  stride  toward  such 
knowledge. 

The  Human  Genome  Project  has  other 
implications  for  the  DOE  as  well.  In  1994, 
taking  advantage  of  new  capabilities  devel- 
oped by  the  genome  project,  the  DOE  for- 
mulated the  Microbial  Genome  Initiative  to 
sequence  the  genomes  of  bacteria  of  likely 
interest  in  the  areas  of  energy  production  and 
use,  environmental  remediation  and  waste 
reduction,  and  industrial  processing.  As  a 
result  of  this  initiative,  we  already  have  com- 
plete sequences  for  two  microbes  that  live 
under  extreme  conditions  of  temperature  and 
pressure.  Structural  studies  are  under  way  to 
learn  what  is  unique  about  the  proteins  of 
these  organisms— the  aim  being  ultimately  to 
engineer  these  microbes  and  their  enzymes 
for  such  practical  purposes  as  waste  control 
and  environmental  cleanup.  (DOE-funded 
genetic  engineering  of  a  thermostable  DNA 
polymerase  has  already  produced  an  enzyme 
that  has  captured  a  large  share  of  the  several- 
hundred-million-dollar  DNA  polymerase 
market.) 

And  other  little-studied  microbes  hint 
at  even  more  intriguing  possibilities.  For 
instance,  Deiiwcoatu  radioiiiranj  is  a  species 
that  prospers  even  when  exposed  to  huge 
doses  of  ionizing  radiation.  This  microbe  has 
an  amazing  ability  to  repair  radiation- 
induced  damage  to  its  DNA.  Its  genome  is 
currently  being  sequenced  with  DOE  sup- 
port, with  the  hope  of  understanding  and 
ultimately  taking  practical  advantage  of  its 
unusual  capabilities.  For  example,  it  might 
be  possible  to  insert  foreign  DNA  into  this 
microbe  that  allows  it  to  digest  toxic  organic 


components  found  in  highly  radioactive 
waste,  thus  simplifying  the  task  of  further 
cleanup.  Another  approach  might  be  to 
introduce  metal-binding  proteins  onto  the 
microbe's  surface  that  would  scavenge  highly 
radioactive  isotopes  out  of  solution. 

Biotechnology,  fueled  in  part  by 
insights  reaped  from  the  genome  project,  will 
also  play  a  significant  role  in  improving 
the  use  of  fossil-based  resources.  Increased 
energy  demands,  projected  over  the  next  50 
years,  require  strategies  to  circumvent  the 
many  problems  associated  with  today's 
dominant  energy  systems.  Biotechnology 
promises  to  help  address  these  needs  by 
upgrading  the  fuel  value  of  our  current  ener- 
gy resources  and  by  providing  new  means  for 
the  bioconversion  of  raw  materials  to  refined 
products  — not  to  mention  offering  the 
possibility  of  entirely  new  biomass-based 
energy  sources. 

We  have  thus  seen  only  the  dawn  of  a 
biological  revolution.  The  practical  and  eco- 
nomic applications  of  biology  are  destined  for 
dramatic  growth.  Health-related  biotechnol- 
ogy is  already  a  multibillion-dollar  success 
story —  and  is  still  far  from  reaching  its  poten- 
tial. Other  applications  of  biotechnology  are 
likely  to  beget  similar  successes  in  the  coming 
decades.  Among  these  applications  are  sev- 
eral of  great  importance  to  the  DOE.  We  can 
look  to  improvements  in  waste  control  and  an 
exciting  era  of  environmental  bioremedia- 
tion;  we  will  see  new  approaches  to  improv- 
ing energy  efFiciency;  and  we  can  even  hope 
for  dramatic  strides  toward  meeting  the  fuel 
demands  of  the  future.  The  insights,  the 
technologies,  and  the  infrastructure  that  are 
already  emerging  from  the  genome  project, 
together  with  advances  in  fields  such  as  com- 
putational and  structural  biology,  are  among 
our  most  important  tools  in  addressing  these 
national  needs. 


Aristides  A.  N.  Patrinos 
Director,  Human  Genome  Project 
U.S.  Department  of  Energy 
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Tke  Genome  Project- 
why  the  DOE? 


A      BOLD      BUT      LOOICAL      STEP 


THE  BIOSCIENCES  RESEARCH  com- 
munity is  now  embarked  on  a 
program  whose  boldness,  even 
audacity,  has  prompted  compar- 
isons with  such  visionary  efforts 
as  the  Apollo  space  program  and  the 
Manhattan  project.  That  life  scientists 
should  conceive  such  an  ambitious  project  is 
not  remarkable;  what  is  surprising  — at  least 
at  first  blush  — is  that  the  project  should  trace 
its  roots  to  the  Department  of  Energy. 

For  close  to  a  half-century,  the  DOE 
and  its  governmental  predecessors  have  been 
charged  with  pursuing  a  deeper  understand- 
ing of  the   potential   health 
risks   posed  by  energy  use 
and    by   energy-production 
technologies  —  with    special 
interest     focused     on     the 
effects     of     radiation      on 
humans.   Indeed,  it  is  fair  to 
say  that   most  of  what  we 
know  today  about  radiologi- 
cal   health    hazards    stems 
from   studies   supported   by 
these  government  agencies. 
Among  these  investigations 
are  long-standing  studies  of 
the  survivors  of  the  atomic 
bombings  of  Hiroshima  and 
Njigasaki,   as   well   as   any 
number     of     experimental 
studies  using  animals,  cells 
in  culture,  and  nonliving  systems.   Much  has 
been   learned,   especially  about  the   conse- 
quences of  exposure  to  high  doses  of  radia- 
tion.    On  the  other  hand,   many  questions 
remain  unanswered;  in  particular,  we  have 
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much  to  learn  about  how  low  doses 
produce  their  insidious  effects.  When  present 
merely  in  low  but  significant  amounts,  toxic 
agents  such  as  radiation  or  mutagenic  chemi- 
cals work  their  mischief  in  the  most  subtle 
ways,  altering  only  slightly  the  genetic 
instructions  in  our  cells.  The  consequences 
can  be  heritable  mutations  too  slight  to  pro- 
duce discernible  effects  in  a  generation  or  two 
but,  in  their  persistence  and  irreversi- 
bility, deeply  troublesome  nonetheless. 

Until  recently,  science  offered  little 
hope  for  detecting  at  first  hand  these 
tiny  changes  to  the  DNA  that  encodes  our 
genetic  program.  Needed  was  a  tool  that 
could  detect  a  change  in  one  "word"  of 
the  program,  among  perhaps  a  hundred 
million.  Then,  in  1984,  at  a  meeting  convened 
jointly  by  the  DOE  and  the  International 
Commission  for  Protection  Against  Environ- 
mental Mutagens  and  Carcinogens,  the  ques- 
tion was  first  seriously  asked:  Can  we,  should 
we,  sequence  the  human  genome?  That  is, 
can  we  develop  the  technology  to  obtain  a 
word-by-word  copy  of  the  entire  genetic 
script  for  an  '"average"  human  being,  and  thus 
to  establish  a  benchmark  for  detecting  the 
elusive  mutagenic  effects  of  radiation  and 
cancer-causing  toxins?  Answering  such  a 
question  was  not  simple.  Workshops  were 
convened  in  1985  and  1986;  the  issue  was 
studied  by  a  DOE  advisory  group,  by  the 
Congressional  Office  of  Technology  Assess- 
ment, and  by  the  National  Academy  of 
Sciences;  and  the  matter  was  debated  publicly 
and  privately  among  biologists  themselves.  In 
the  end,  however,  a  consensus  emerged  that 
we  should  make  a  start. 
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TU  G^iom.  Projtcl  —  irty  (A.  DOE? 


Adding  impetus  to  the  DOE's  earliest 
interest  in  the  human  genome  was  the 
Department's  stewardship  of  the  national 
laboratories,  with  their  demonstrated  ability 
to  conduct  large  multidisciplinary  projects  — 
just  the  sort  of  effort  that  would  be  needed 
to  develop  and  implement  the  technological 
know-how  needed  for  the  Human  Genome 
Project.  Biological  research  programs  al- 
ready in  place  at  the  national  labs  benefited 
from  the  contributions  of  engineers,  physi- 
cists, chemists,  computer  scientists,  and 
mathematicians,  working  together  in  teams. 
Thus,  with  the  infrastructure  in  place  and 
with  a  particular  interest  in  the  ultimate 
results,  the  Department  of  Energy,  in  1986, 
was  the  first  federal  agency  to  announce  and 
to  fund  an  initiative  to  pursue  a  detailed 
understanding  of  the  human  genome. 

Of  course,  interest  was  not  restricted  to 
the  DOE.  Workshops  had  also  been  spon- 
sored by  the  National  Institutes  of  Health, 
the  Cold  Spring  Harbor  Laboratory,  and  the 
Howard  Hughes  Medical  Institute.  In  1988 
the  NIH  joined  in  the  pursuit,  and  in  the  fall 
of  that  year,  the  DOE  and  the  NIH  signed  a 
memorandum  of  understanding  that  laid  the 
foundation  for  a  concerted  interagency  effort. 
The  basis  for  this  community-wide  excite- 
ment is  not  hard  to  comprehend.  The  first 
impulse  behind  the  DOE's  commitment  was 
only  one  of  many  recisons  for  coveting  a 
deeper  insight  into  the  human  genetic  script. 
Defective  genes  directly  account  for  an  esti- 
mated 4000  hereditary  human  diseases  — mal- 
adies such  as  Huntington  disease  and  cystic 
fibrosis.  In  some  such  cases,  a  single  mis- 
placed letter  among  three  billion  can  have 
lethal  consequences.  For  most  of  us,  though, 
even  greater  interest  focuses  on  the  far  more 
common  ailments  in  which  altered  genes 
influence  but  do  not  prescribe.  Heart  dis- 
ease, many  cancers,  and  some  psychiatric  dis- 
orders, for  example,  can  emerge  from  compli- 
cated interplays  of  environmental  factors  and 
genetic  misinformation. 

The  first  steps  in  the  Human  Genome 
Project  are  to  develop  the  needed  technolo- 
gies,  then   to   "map'   and   "sequence"  the 


genome.  But  in  a  sense,  these  well-publi- 
cized efforts  aim  only  to  provide  the  raw 
material  for  the  next,  longer  strides.  The  ulti- 
mate goal  is  to  exploit  those  resources  for  a 
truly  profound  molecular-level  understand- 
ing of  how  we  develop  from  embryo  to  adult, 
what  makes  us  work,  and  what  causes  things 
to  go  wrong.  The  benefits  to  be  reaped 
stretch  the  imagination.  In  the  offing  is  a 
new  era  of  molecular  medicine  characterized 
not  by  treating  symptoms,  but  rather  by 
looking  to  the  deepest  causes  of  disease. 
Rapid  and  more  accurate  diagnostic  tests  will 
make  possible  earlier  treatment  for  countless 
maladies.  Even  more  promising,  insights 
into  genetic  susceptibilities  to  disease  and  to 
environmental  insults,  coupled  with  preven- 
tive therapies,  will  thwart  some  diseases  alto- 
gether. New,  highly  targeted  pharmaceuti- 
cals, not  just  for  heritable  diseases,  but  for 
communicable  ailments  as  well,  will  attack 
diseases  at  their  molecular  foundations.  And 
even  gene  therapy  will  become  possible,  in 
some  cases  actually  "fixing"  genetic  errors. 
All  of  this  in  addition  to  a  new  intellectual 
perspective  on  who  we  are  and  where  we 
came  from. 

The  Department  of  Energy  is  proud  to 
be  playing  a  central  role  in  propelling  us 
toward  these  noble  goals. 
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THE      KECIPE      FOB     LIFE 


FOR  ALL  THE  DIVERSITY  of  the 
world's  five  and  a  half  billion  peo- 
ple, full  of  creativity  and  contra- 
dictions, the  machinery  of  every 
human  mind  and  body  is  built 
and  run  with  fewer  than  100,000  kinds  of 
protein  molecules.  And  for  each  of  these  pro- 
teins, we  can  imagine  a  single  corresponding 
gene  (though  there  is  sometimes  some  redun- 
dancy) whose  job  it  is  to  ensure  an  adequate 
and  timely  supply.  In  a  material  sense,  then, 
all  of  the  subtlety  of  our  species,  all  of  our  art 
and  science,  is  ultimately  accounted  for  by  a 
surprisingly  small  set  of  discrete  genetic 
instructions.  More  surprising  still,  the  differ- 
ences between  two  unrelated  individuals, 
between  the  man  next  door  and  Mozart,  may 
reflect  a  mere  handful  of  differences  in  their 
genomic  recipes  — perhaps  one  altered  word 
in  five  hundred.  We  are  far  more  alike  than 
we  are  different.  At  the  same  time,  there  is 
room  for  near-infinite  variety. 

It  is  no  overstatement  to  say  that  to 
decode  our  100,000  genes  in  some  funda- 
mental way  would  be  an  epochal  step  toward 
unraveling  the  manifold  mysteries  of  life. 

Some   definitions 

The  htiman  genome  is  the  full  comple- 
ment of  genetic  material  in  a  human  cell. 
(Despite  five  and  a  half  billion  variations  on  a 
theme,  the  differences  from  one  genome  to 
the  next  are  minute;  hence,  we  hear  about  the 
human  genome  — as  if  there  were  only  one.) 
The  genome,  in  turn,  is  distributed  among  23 
sets  oi chronwMimj ,  which,  in  each  of  us,  have 
been   replicated  and  re-replicated  since  the 


fusion  of  sperm  and  egg  that  marked  our  con- 
ception. The  source  of  our  personal  unique- 
ness, our  full  genome,  is  therefore  preserved 
in  each  of  our  body's  several  trillion  cells.  At 
a  more  basic  level,  the  genome  is  DNA, 
deoxyribonucleic  acid,  a  natural  polymer 
built  up  of  repeating  micleotidtj,  each  consist- 
ing of  a  simple  sugar,  a  phosphate  group,  and 
one  of  four  nitrogenous  bases.  The  hierarchy 
of  structure  from  chromosome  to  nucleotide 
is  shown  in  Figure  1.  In  the  chromosomes, 
two  DNA  strands  are  twisted  together  into 
an  entwined  spiral  — the  famous  double 
helix  — held  together  by  weak  bonds  between 
complementary  bases,  adenine  (A)  in  one 
strand  to  thymine  (T)  in  the  other,  and  cyto- 
sine  to  guanine  (C-G).  In  the  language  of 
molecular  genetics,  each  of  these  linkages 
constitutes  a  hoje  pair.  All  told,  if  we  count 
only  one  of  each  pair  of  chromosomes,  the 
human  genome  comprises  about  three  billion 
base  pairs 

The  specificity  of  these  base-pair  link- 
ages underlies  all  that  is  wonderful  about 
DNA.  First,  replication  becomes  straightfor- 
ward. Unzipping  the  double  helix  provides 
unambiguous  templates  for  the  synthesis  of 
daughter  molecules:  One  helix  begets  two 
with  near-perfect  fidelity.  Second,  by  a  simi- 
lar template-based  process,  depicted  in 
Figure  2,  a  means  is  also  available  for  pro- 
ducing a  DNA-like  messenger  to  the  cell 
cj^oplasm.  There,  this  mejjenger  RNA,  the 
faithful  complement  of  a  particular  DNA 
segment,  directs  the  synthesis  of  a  particular 
protein.  Many  subtleties  are  entailed  in  the 
synthesis  of  proteins,  but  in  a  schematic 
sense,  the  process  is  elegantly  simple. 
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f  leUKi  1 .  lOMI  DNA  DfTAILS.  Ap**  *""  rtprodactlvt  gaittti,  mA  mH  sI  tkt 
kiBoi  body  iMtAs  23  pan  oi  (hrenosoaiii.  Mil  q  fotkil  oi  conprnMd  nd  •■twiitd  DNA. 
Evny  itraad  ol  At  DNA  b  o  biga  latiral  polyiitr  of  reptotbig  ■■daoHdt  nHi,  taib  of  wbkb 
(onpriMS  a  pbospbots  ^Mp,  o  sign  (dfloxyribese),  aid  a  base  (dtber  odniia,  tbyniie,  cytesiie, 
or  gigabit).  Enry  ttrad  tbii  mbsdH  o  todt  el  feir  cbwoctin  (A'l,  T'l,  C'l,  aid  G's),  tbi  n dpi 
for  tbc  itocbfaiiry  of  binoi  ill.  Ii  Hs  lomal  stati,  DNA  tabis  tbi  forn  of  o  bigbly  rigilv  doibli- 
straidid  bib,  tbi  stioids  of  wbkb  ari  ihibid  by  by^ogii  boids  betwm  odeitai  ad  tbynlii  (A-T) 
Old  bitwiM  cytoskfl  aid  gaoilii  (C-G)    Eocb  neb  kbogi  h  sdd  to  toistftiti  a  ban  pab;  moii 
tbrii  bBKoi  bosi  poks  coastftiti  tbi  binoi  gnomi    It  is  tbi  spidfkity  of  tbisi  bas«-pak  faibagis 
tbat  ndirlks  tbi  iiicbaibai  ol  DNA  reptcotloi  Dhstrotid  bin    Eocb  strad  ol  tbi  doibli  bili 
Sims  as  0  tiaiphti  lor  tbi  syitbisis  of  e  iiw  straid,  tbi  iidiotldi  siqioiii  ol  wbkb  b  strktiy 
ditimhiid.  Riplkatioi  tbis  prodicis  twia  daigbtii  bitus,  M(b  oi  ixact  nplka  ol  Hs  soli  poiMt 


Nucleus   *^ 


Phosphate     '^ 


A  single  nucleotide 


Eveiy  protein  is  made  up  of  one  or 
more  polypeptide  chains,  each  a  series  of 
(typically)  several  hundred  molecules  known 
as  amino  acSj,  linked  by  so-called  peptide 
bonds.  Remarkably,  only  20  different  kinds 
of  amino  acids  suffice  as  the  building  blocks 
for  all  human  proteins.  The  synthesis  of  a 
protein  chain,  then,  is  simply  a  matter  of 
specifying  a  particular  sequence  of  amino 
acids.  This  is  the  role  of  the  messenger  RNA. 
(The  same  nitrogenous  bases  are  at  work  in 


RNA  as  in  DNA,  except  that  uracil  takes  the 
place  of  the  DNA  base  thymine.)  Elach  lin- 
ear sequence  of  three  bases  (both  in  RNA 
and  in  DNA)  corresponds  uniquely  to  a 
single  amino  acid.  The  RNA  sequence  AAU 
thus  dictates  that  the  amino  acid  asparagine 
should  be  added  to  a  polypeptide  chain,  GCA 
specifies  alanine  — and  so  on.  A  segment  of 
the  chromosomal  DNA  that  directs  the  S3'n- 
thesis  of  a  single  type  of  protein  constitutes 
a  single  ^f/2«. 


352 


The  plan  incluJes 
goals  for  genetic 
ana  physical 
I  mapping,  DNA 
sequencing, 
identifying  ana 
locating  genes, 
ana  pursuing  fur- 
ftner  aevefopmenta 
^    in  teamology 
ana  informatics. 


A     PLAN     OF    ACTION 

In  1990  the  Department  of  Energy  and 
the  National  Institutes  of  Health  developed  a 
joint  research  plan  for  their  genome  pro- 
grams, outlining  specific  goals  for  the  ensu- 
ing five  years.  Three  years  later,  emboldened 
by  progress  that  was  on  track  or  even  ahead 
of  schedule,  the  two  agencies  put  forth  an 
updated  five-year  plan.  Improvements  in 
technology,  together  with  the  experience  of 
three  years,  allowed  an  even  more  ambitious 
prospect. 

In  broad  terms,  the  revised  plan 
includes  goals  for  genetic  and  physical 
mapping  of  the  genome,  DNA  sequencing, 
identifying  and  locating 
genes,  and  pursuing  further 
developments  in  technology 
and  informatics.  To  a  large 
extent,  the  following  pages 
are  devoted  to  a  discussion 
of  just  what  these  goals 
mean,  and  what  part  the 
DOE  is  playing  in  pursuing 
them.  In  addition,  the  plan 
emphasizes  the  continuing 
importance  of  the  ethical, 
legal,  and  social  implications 
of  genome  research,  and  it 
underscores  the  critical  roles 
of  scientific  training,  tech- 
nology transfer,  and  public 
access  to  research  data  and 
materials.  Most  of  the  goals 
focus  on  the  human  genome, 
but  the  importance  of  con- 
tinuing research  on  widely 
organisms"  is  also  explicitly 


studied  "mode 
recognized. 

Among  the  scientific  goals  of  human 
genome  research,  several  are  especially 
notable,  as  they  provide  clear  milestones  for 
future  progress.  In  reciting  them,  however,  it 
is  important  to  note  an  underlying  assump- 
tion of  adequate  research  support.  Such  sup- 
port is  obviously  crucial  if  the  joint  plan  is  to 
succeed.  Some  of  the  central  goals  for 
1993-98  follow: 


♦  Complete  a  genetic  linkage  map  at  a  reso- 
lution of  two  to  five  centimorgans  by 
1995  — As  discussed  on  page  10,  this  goal 
was  far  surpassed  by  the  fall  of  1994. 

♦  Complete  a  physical  map  at  a  resolution 
of  100  kilobases  by  1998-This  implies 
a  genome  map  vntU  30,000  "signposts," 
separated  by  an  average  of  100,000 
base  pairs.  Further,  each  signpost  will  be 
a  jeqiunce-taggeS  jite,  a  stretch  of  DNA 
with  a  unique  and  well-defined  DNA 
sequence.  Such  a  map  will  greatly  facili- 
tate "production  sequencing"  of  the  entire 
genome.  By  the  end  of  1995,  molecular 
biologists  were  halfway  to  this  goal:  A 
physical  map  was  announced  with  15,000 
sequence-tagged  signposts.  Physical  map- 
ping is  discussed  on  pages  10-16. 

♦  By  1998  develop  the  capacity  to  sequence 
50  million  base  pairs  per  year  in  long 
continuous  segments  — Adequate  fiscal 
investment  and  continuing  progress 
beyond  1998  should  then  produce  a 
fully  sequenced  human  genome  by  the 
year  2005  or  earlier.  Sequencing  is  the 
subject  of  pages  16—26. 

♦  Develop  efficient  methods  for  identifying 
and  locating  known  genes  on  physical 
maps  or  sequenced  DNA  — The  goals 
here  are  less  quantifiable,  but  the  aim  is 
central  to  the  Human  Genome  Project:  to 
home  in  on  and  ultimately  to  understand 
the  most  important  human  genes,  namely, 
the  ones  responsible  for  serious  diseases 
and  those  crucial  for  healthy  development 
and  normal  functions. 

♦  Pursue  technological  developments  in 
areas  such  as  automation  and  robotics  — 
A  continuing  emphasis  on  technological 
advance  is  critical.  Innovative  technolo- 
gies, such  as  those  described  on  pages 
27-30,  are  the  necessaiy  underpinnings  of 
future  large-scale  sequencing  efforts. 

♦  Continue  the  development  of  database 
tools  and  software  for  managing  and 
interpreting  genome  data— This  is  the 
area  of  informatics,  discussed  on  pages 
30-31.  The  challenge  is  not  so  much  the 
volume  of  data,  but  rather  the  need  to 
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FIOUBE  3.  FBOM  61NES  TO  I^ROTEINS.    1*  tile  (el  aedegs,  RNA  b  pfoihcei)  by  tranaiption,  ig  mudi  tte  sane  way 
riiot  DNA  lepBcotes  Htell.  RNA,  kowevei,  sebiWetes  tke  sagor  ibo»  iof  deoiyrkse  ad  the  bate  irocil  lot  tkyniie,  aeil  is  isialY 
siijle-stroided.  Oie  torn  of  RNA,  messeiget  RNA  or  uRNA,  coiveys  the  DNA  redpe  for  proteii  synthesis  to  the  (el  (ytoplasai. 
There,  bond  tenporarSy  to  o  (ytoplasiii(  portide  hiowi  as  a  rihosoaie,  eo(h  three-base  (odoa  ol  the  mRNA  fafcs  to  o  spedfii  lorn  of 
transfer  RNA  (tRNA)  (Oitaiihig  the  (onplenentary  three-bose  seqieare.  Tin  tRNA,  in  tim,  transfers  a  single  amino  o(id  to  o  growhg 
protein  (hoin.  Eo(h  (odon  thos  aaambigaoosly  ibe(ts  the  addition  of  one  amino  add  to  the  protein.  On  the  other  band,  the  same  ombo 
odd  (an  be  odded  by  Afferent  (odons;  fa  thb  Hnstration,  the  mRNA  seipien(es  GCA  and  Kl  mt  both  spedfyfag  the  addition  ol  the 
I  a<id  olanfae  (Ala). 


mount  a  system  compatible  with  re- 
searchers around  the  worW,  and  one  that 
will  allow  scientists  to  contribute  new  data 
and  to  freely  interrogate  the  existing  data- 
bases. The  ultimate  measure  of  success 
will  be  the  ease  with  which  biologists  can 
fruitfully  use  the  information  produced  by 
the  genome  project. 
♦  Continue  to  explore  the  ethical,  legal, 
and  social  implications  of  genome 
research  — Much  emphasis  continues  to  be 
placed  on  issues  of  privacy  and  the  fair  use 
of  genetic  information.  New  goals  focus 
on  defining  additional  pertinent  issues  and 


developing  policy  responses  to  them,  dis- 
seminating policy  options  regarding 
genetic  testing  services,  fostering  greater 
acceptance  of  human  genetic  variation, 
and  enhancing  public  and  professional 
education  that  is  sensitive  to  sociocultural 
and  psychological  issues.  This  side  of 
the  genome  project  is  discussed  on 
pages  32-33. 
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MAPPING     THC     TIRKAIN 


Just  as  there 
are  topographic 

maps  and 

political  maps 

ana  highway 

maps,  so  there 

are  different 

kinas  of 

genome  maps. 


ONE  OF  THE  CENTRAL  COALS  of 
the  Human  Genome  Project 
is  to  produce  a  detailed  "map" 
of  the  human  genome.  But. 
just  as  there  are  topographic 
maps  and  political  maps  and  highway  maps  of 
the  United  States,  so  there  are  different  kinds 
of  genome  maps,  the  variety  of  which 
Js  suggested  in  Figure  3.  One  type,  a.  genet u 
linkage  map,  is  based  on  careful  analyses 
of  human  inheritance  patterns.  It  indicates 
for  each  chromosome  the 
whereabouts  of  genes  or 
other  "heritable  markers," 
with  distances  measured  in 
centimorgans,  a  measure  of 
recombination  frequency. 
During  the  formation  of 
sperm  and  egg  cells,  a  process 
of  genetic  recombination  — or 
"crossing  over"  —  occurs  in 
which  pieces  of  genetic  mate- 
rial are  swapped  between 
paired  chromosomes.  This 
process  of  chromosomal 
scrambling  accounts  for  the 
differences  invariably  seen 
even  in  siblings  (apart  from 
identical  twins).  Logically,  the  closer  two 
genes  are  to  each  other  on  a  single  chromo- 
some, the  less  likely  they  are  to  get  split  up 
during  genetic  recombination.  When  they 
are  close  enough  that  the  chances  of  being 
separated  are  only  one  in  a  hundred,  they 
are  said  to  be  separated  by  a  distance  of 
one  centimorgan. 


The  role  of  human  pedigrees  now 
becomes  clear.  By  studying  family  trees  and 
tracing  the  inheritance  of  diseases  and  physi- 
cal traits,  or  even  unique  segments  of  DNA 
identifiable  only  in  the  laboratory,  geneticists 
can  begin  to  pin  down  the  relative  positions 
of  these  genetic  markers.  By  the  end  of  1994. 
a  comprehensive  map  wjis  available  that 
included  more  than  5800  such  markers, 
including  genes  implicated  in  cystic  fibrosis, 
myotonic  dystrophy,  Huntington  disease, 
Tay-Sachs  disease,  several  cancers,  and  many 
other  maladies.  The  average  gap  between 
markers  was  about  0.7  centimorgan. 

Other  maps  are  known  as  phyjical niapj , 
so  called  because  the  distances  between  fea- 
tures are  measured  not  in  genetic  terms,  but 
in  "real"  physical  units,  typically,  numbers  of 
base  pairs.  A  close  analogy  can  thus  be 
drawn  between  physical  maps  and  the  road 
maps  familiar  to  us  all.  Indeed,  the  analogy 
can  be  extended  further.  Just  as  small-scale 
road  maps  may  show  only  large  cities  and 
indicate  distances  only  between  major  fea- 
tures, so  a  low-resolution  physical  map 
includes  only  a  relative  sprinkling  of  chromo- 
somal landmarks.  A  well-known  low-resolu- 
tion physical  map,  for  example,  is  the  familiar 
chromosomal  map,  showing  the  distinctive 
staining  patterns  that  can  be  seen  in  the  light 
microscope.  Further,  by  a  process  known  as 
in  jitu  hyliriiLzation,  specific  segments  of  DNA 
can  be  targeted  in  intact  chromosomes  by 
using  complementary  strands  synthesized  in 
the  laboratory.  These  laboratory-made 
"probes"  carry  a  fluorescent  or  radioactive 
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Low-resolution  physical 
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riaUBi   3.   OINOMIC   OlOaKAPHY.    rU  hraim  jtioni  cm  ki 

■^^d  ii  a  naikK  of  wop.  Tki  (ndoi  ad  npredadkk  baidig  pottcn  el 
rin  ikeanienn  CMtlltdn  om  Uid  of  (Aydcol  nap,  mi  k  niir  inn,  Ikt 
poiHion  e(  (MM  oi  eHiir  htritakh  norktn  k«n  b«N  bcofatd  to  Mt  kad  « 
aMlkn.  Mart  cmM  on  (lattk  lUagt  aopi,  m  wkkk  rin  rtkthi  pssMen  ol 
■arktrt  kevt  ktai  ntokUf  d  ky  itidykig  kow  heqitiriy  tkt  nvkin  ■•  stp- 
orottd  dnhg  a  »tiral  proiin  <l  ckraaoieiid  ikiHIig  uld  gmtk  ricoakt- 
•otk*.  Ika  cryptkafly  cadtd  efd«id  Baktri  ■!■  Ikt  tsf  el  tkb  flgire  ■•  pkys- 
kdy  ei^id  te  tpedfk  rt^n  ol  ckfonoioBa  19;  son*  oi  fktn  Ao  contHito 


0  lowf  ttohrtlei  goMtk  hkojt  n^.  (Hn^odi  oi  gens  o«d  oriitr  nariiin  kon 
kooi  Boppod  01  ckronoione  1 9;  ooiy  o  iew  oio  tadkottd  kirt.  Sm  Hgort  5  lor 
a  dbf ley  ol  eioppod  goeot.)  A  kIgkor-rotolilieB  pkyikdl  ii^  bI^*  doKribo,  a% 
skewi  kon,  tko  cttttag  th«  (tk«  ikort  vortkol  l»s)  lor  loiteb  DIU-clia«iBg 
onynti.  Tko  ovarl^phg  ira^oeh  tkot  dkw  nik  e  nop  to  k«  loistnrtod  en 
tkoe  tko  lotearcof  lor  okt^ifag  tkt  ohhioti  pkyikol  eiop,  tko  keso-peit  Mqeeece 
Ih  tko  hnei  goeoeio.  At  tki  kotteii  ol  tkb  Bgero  b  ■  tiaapio  el  eetpet  heel 

01  oeteiMlk  sofnedog  irackho. 
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riOUKE  4.  FfCKINO 
rOB  SENES.    FboraKtiH 
ii  sHi  hrkidhiiKoi  (FISH) 
piokn  ai  itioids  of  DMA  Ikat 
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DSRAD  geie  was  tkei  acre- 
rotely  napped  to  o  iorrow 
regioa  oi  tto  loag  arn  of 
(kroffosoaia  1. 


label,  which  can  then  be  detected  and  thus 
pinpointed  on  a  specific  region  of  the  chro- 
mosome. Figure  4  shows  some  results  of 
fluorescence  in  situ  hybridization  (FISH). 
Of  particular  interest  are  probes  known  as 
cDNA  (for  complementary  DNA),  which  are 
synthesized  by  using  molecules  of  messenger 
RNA  as  templates.  These  molecules  of 
cDNA  thus  hybridize  to  "expressed"  chromo- 
somal regions  — regions  that  directly  dictate 
the  synthesis  of  proteins.  However,  a  physi- 
cal map  that  depended  only  on  in  situ 
hybridization  would  be  a  fairly  coarse  one. 
Fluorescent  tags  on  intact  chromosomes  can- 
not be  resolved  into  separate  spots  unless 
they  are  two  to  five  million  base  pairs  apart. 

Fortunately,  means  are  also  available  to 
produce  physical  maps  of  much  higher  reso- 
lution—analogous to  large-scale  county  maps 
that  show  every  village  and  farm  road,  and 
indicate  distances  at  a  similar  level  of  detail. 
Just  such  a  detailed  physical  map  is  one  that 
emerges  from  the  use  of  rejtriction  enzymej  — 
DNA-cleaving  enzymes  that  serve  as  highly 
selective  microscopic  scalpels  (see  "Tools  of 


the  Trade,"  pages  17-19).  A  typical  restric- 
tion enzyme  known  as  £coRI,  for  example, 
recognizes  the  DNA  sequence  GAATTC  and 
selectively  cuts  the  double  helix  at  that  site. 
One  use  of  these  handy  tools  involves  cutting 
up  a  selected  chromosome  into  small  pieces, 
then  cloning  and  ordering  the  resulting  frag- 
ments. The  cloning,  or  copying,  process  is  a 
product  of  recombinant  DNA  technology,  in 
which  the  natural  reproductive  machinery  of 
a  "host"  organism— a  bacterium  or  a  yeast, 
for  example  —  replicates  a  "parasitic"  frag- 
ment of  human  DNA,  thus  producing  the 
multiple  copies  needed  for  further  study  (see 
"Tools  of  the  Trade").  By  cloning  enough 
such  fragments,  each  overlapping  the  next 
and  together  spanning  long  segments  (or 
even  the  entire  length)  of  the  chromosome, 
workers  can  eventually  produce  an  ordered 
library  of  clones.  E^ch  contiguous  block  of 
ordered  clones  is  known  as  a  contig  (a  small 
one  is  shown  in  Figure  3),  and  the  resulting 
map  is  a  contig  map.  If  a  gene  can  be  local- 
ized to  a  single  fragment  within  a  contig  map, 
its  physical  location  is  thereby  accurately 
pinned  down.  Further,  these  conveniently 
sized  clones  become  resources  for  further 
studies  by  researchers  around  the  world  — 
as  well  as  the  natural  starting  points  for 
systematic  sequencing  efforts. 

Two    GIANT    STEPS: 

Chromosomes   16  and   19 

One  of  the  signal  achievements  of  the 
DOE  genome  effort  so  far  is  the  successful 
physical  mapping  of  chromosomes  16  and  19. 
The  high-resolution  chromosome  19  map, 
constructed  at  the  Lawrence  Livermore 
National  Laboratory,  is  based  on  restriction 
fragments  cloned  in  cojmii.i,  synthetic  cloning 
"vectors"  modeled  after  bacteria-infecting 
viruses  known  as  bacteriophages.  Like  a 
phage,  a  cosmid  hijacks  the  cellular  machin- 
ery of  a  bacterium  to  mass-produce  its  own 
genetic  material,  together  with  any  "foreign" 
human  DNA  that  has  been  smuggled  into  it. 
The  foundation  of  the  chromosome  19  map  is 
a  large  set  of  cosmid  contigs  that  were  assem- 
bled by  automated  analysis  of  overlapping 
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but  unordered  restriction  fragments.  These 
contigs  span  an  estimated  54  million  base 
pairs,  more  than  95  percent  of  the  chromo- 
some, excluding  the  centromere. 

Most  of  the  contigs  have  been  mapped 
by  fluorescence  in  situ  hybridization  to  visi- 
ble chromosomal  bands.  Further,  more  than 
200  cosmids  have  been  more  accurately 
ordered  along  the  chromosome  by  a  high-res- 
olution FISH  technique  in  which  the  dis- 
tances between  cosmids  are  determined  with 
a  resolution  of  about  50,000  base  pairs.  This 
ordered  FISH  map,  with  cosmid  reference 
points  separated  by  an  average  of  230,000 
base  pairs,  provides  the  essential  framework 
to  which  other  cosmid  contigs  can  be 
anchored.  Moreover,  the  EcoRl  restriction 
sites  have  been  mapped  on  more  than  45  mil- 
lion base  pairs  of  the  overall  cosmid  map. 
Over  450  genes  and  genetic  markers  have 
also  been  localized  on  this  map,  of  which 
nearly  300  have  been  incorporated  into  the 
ordered  map.  Figure  5  shows  the  locations  of 
the  mapped  genes.  Among  these  genes  is  the 
one  responsible  for  the  most  common  form  of 
adult  muscular  dystrophy  (DM),  which  was 
identified  in  1992  by  an  international  consor- 
tium that  included  Livermore  scientists. 
A  second  important  disease  gene  (COMP), 
responsible  for  a  form  of  dwarfism  known 
as  pseudoachondroplasia,  has  also  been  iden- 
tified. And  yet  another  gene,  one  linked  to  a 
form  of  congenital  kidney  disease,  has  been 
localized  to  a  single  contig  spanning  one 
million  base  pairs,  but  has  not  yet  been 
precisely  pinpointed.  About  2000  other 
genes  are  likely  to  be  found  eventually  on 
chromosome  19. 

In  a  similar  effort,  the  Los  Alamos 
National  Laboratory  Center  for  Human 
Genome  Studies  has  completed  a  highly  inte- 
grated map  of  chromosome  16,  a  chromo- 
some that  contains  genes  linked  to  blood  dis- 
orders, a  second  form  of  kidney  disease, 
leukemia,  and  breast  and  prostate  cancers. 
A  readable  display  of  this  integrated  map 
covers  a  sheet  of  paper  more  than  15  feet 
long;  a  portion  of  it,  much  reduced  and 
showing  only  some  of  its  central  features,  is 
reproduced  here  as  Figure  6.  The  framework 


for  the  Los  Alamos  effort  is  yet  another  kind 
of  map,  a  "cytogenetic  breakpoint  map" 
based  on  78  lines  of  cultured  cells,  each  a 
hybrid  that  contains  mouse  chromosomes 
and  a  fragment  of  human  chromosome  16. 
Natural  breakpoints  in  chromosome  16  are 
thus  identified,  leading  to  a  breakpoint  map 
that  divides  the  chromosome  into  segments 
whose  lengths  average  1.1  million  base  pairs. 
Anchored  to  this  framework  are  a  low-reso- 
lution contig  map  based  on  YAC  clones  and  a 
high-resolution  contig  map  based  largely  on 
cosmids  (for  more  on  YACs,  yeast  artificial 
chromosomes,  see  "Tools  of  the  Trade,"'  pages 
17-19).  The  low-resolution  map,  comprising 
700  YACs  from  a  library  constructed  by  the 
Centre  d'Etude  du  Polymorphisme  Humain 
(CEPH),  provides  practically  complete  cov- 
erage of  the  chromosome,  except  the  highly 
repetitive  DNA  in  the  centromere  region. 
The  high-resolution  map  comprises  some 
4000  cosmid  clones,  assembled  into  about 
500  contigs  covering  60  percent  of  the  chro- 
mosome. In  addition,  it  includes  250  smaller 
YAC  clones  that  have  been  merged  with  the 
cosmid  contig  map.   The  cosmid  contig  map 

FIOURE  6.  MAPPIHO  CMKOMOSOMC  1«. 
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is  an  especially  important  step  forward,  since 
it  is  a  "sequence-ready"  map.     It  is  based 
on    bacterial    clones    that    are    ideal    sub- 
strates   for    DNA    sequencing,    and    fur- 
clones    have    been    restriction 
mapped  to  allow  identification 
of  a  minimum  set  of  overlap- 
ping  clones   for  a   large-scale 
sequencing  effort. 

The  high-  and  low-resolu- 
tion maps  have  been  tied 
together  by  sequence-tagged 
sites  (STSs),  short  but  unique 
stretches  of  DNA  sequence. 
They  have  also  been  integrated 
into  the  breakpoint  map,  and 
with  genetic  maps  developed 
at  the  Adelaide  Children's 
nes  our  Hospital  and  by  CEPH.    The 

integrated  map  also  includes  a 
SpeaeS.  transcription     map     of     1000 

"^^^S      sequenced    txonj    (expressed 
fragments  of  genes)  and  more 
than  600  other  markers  developed  at  other 
laboratories  around  the  world. 

Getting  down  to  details: 
Sequencing  the  genome 

Ultimately,  though,  these  physical  maps 
and  the  clones  they  point  to  are  mere  step- 
ping stones  to  the  most  visible  goal  of  the 
genome  project,  the  string  of  three  billion 
characters  —  As,  T's,  C's,  and  G's  —  represent- 
ing the  sequence  of  base  pairs  that  defines 
our  species.  Included,  of  course,  would  be 
the  sequence  for  every  gene,  as  well  as  the 
sequences  for  stretches  of  DNA  whose  func- 
tions we  don't  yet  know  (but  which  may  be 
involved  in  such  little-understood  processes 
as  orchestrating  gene  expression  in  different 
parts  of  our  bodies,  at  different  times  of  our 
lives).  Should  anyone  undertake  to  print  it 
all  out,  the  result  would  fill  several  hundred 
volumes  the  size  of  a  big-city  phone  book. 

Only  the  barest  start  has  been  made  in 
taking  this  dramatic  step  in  the  Human 
Genome  Project.  Several  hundred  million 
base  pairs  have  been  sequenced  and  archived 
in  databases,  but  the  great  majority  of  these 


are  from  short  "sequence  tags"  on  cloned 
fragments.  Only  about  30  million  base  pairs 
of  human  DNA  (roughly  one  percent  of 
the  total)  have  been  sequenced  in  longer 
stretches,  the  longest  being  about  685,000 
base  pairs  long.  Even  more  daunting  is  the 
realization  that  we  will  eventually  need  to 
sequence  many  parts  of  the  genome  many 
times,  thus  to  reveal  differences  that  indicate 
various  forms  of  the  same  gene. 

Hence,  as  with  so  many  human  enter- 
prises, the  challenge  of  sequencing  the 
genome  is  largely  one  of  doing  the  job 
cheaper  and  faster.  At  the  beginning  of  the 
project,  the  cost  of  sequencing  a  single  base 
pair  was  between  $2  and  $10,  and  one 
researcher  could  produce  between  20,000 
and  50,000  base  pairs  of  continuous,  accurate 
sequence  in  a  year.  Sequencing  the  genome 
by  the  year  2005  would  therefore  likely  cost 
$10-20  billion  and  require  a  dedicated  cadre 
of  at  least  5000  workers.  Clearly,  a  major 
effort  in  technology  development  was  called 
for— an  effort  that  would  drive  the  cost  well 
below  $1  per  base  pair  and  that  would  allow 
automation  of  the  sequencing  process.  From 
the  beginning,  therefore,  the  DOE  has 
emph<isized  programs  to  pave  the  way  for 
expeditious  and  economical  sequencing 
efforts  — programs  to  develop  new  technolo- 
gies, including  new  cloning  vectors,  and  to 
establish  suitable  resources  for  sequencing, 
including  clone  libraries  and  libraries  of 
expressed  sequences. 

Efforts  to  develop  new  cloning  vectors 
have  been  "specially  productive.  YACs 
remain  a  classic  tool  for  cloning  large 
fragments  of  human  DNA,  but  they  are  not 
perfect.  Some  regions  of  the  genome,  for 
example,  resist  cloning  in  YACs,  and  others 
are  prone  to  rearrangement.  New  vectors 
such  as  bacterial  artificial  chromosomes 
(BACs),  PI  phages,  and  Pl-derived  artificial 
cloning  systems  (PACs)  have  thus  been 
devised  to  address  these  problems.  These 
new  approaches  are  critical  for  ensuring 
that  the  entire  genome  can  be  faithfully 
represented  in  clone  libraries,  without  the 
danger  of  deletions,  rearrangements,  or 
spurious  insertions.  Continuej  on  p.  20 
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Over  the  next  decade,  as  molecolar 
biotogiBs  tackle  the  task  of 
sequencing  the  human  genome 
on  a  masiive  scale,  any  number 
ai  innovatton?  can  be  opecied  in 
mapping  and  secpiencing  technologies.  But  several 
of  tfte  central  tools  of  molecular  genetics  are  Itkeiy  to 
stay  with  us— much  improved  perhaps,  but  not  fvin- 
damentalty  cfifierem.  One  such  tool  is  the  class  of 
ONA-cutling  proteins  known  as  rtsitriahn  enzymes. 
These  enzymes,  the  first  of  which  were  discovered  in 
the  late  1960s,  cleave  double-aranded  DNA  mote- 
cules  at  specific  recognition  sites,  usually  four  or  six 
nucleotides  long,  for  example,  a  leslrictJOn  enzyme 
called  &0RI  recognizes  the  singfe-strsnd  sequence 
CAAnC  and  invariably  cuts  the  double  helix  as 
shown  in  the  illustratlQn  on  the  right. 

When  digestefi  with  a  panicutar  restriction 
en2yme,  then,  identical  segments  of  human  DNA 
yjelct  idenlica!  sets  of  restriction  fragments.  On  the 
other  band.  DNA  from  the  same  genomic  region  of 
two  different  people,  with  their  subtly  different 
genomic  sequences,  can  yield  dtssimilar  sets  of  frag- 
ments, which  then  produce  different  patterns  when 
sorted  according  to  size. 

This  leads  directiy  to  discussion  of  a  second 
essential  tool  of  modern  molecular  genetics,  gel 
ehctrophf>resis,  for  it  is  by  electroplioresis  that  DNA 
fragmems  of  different  sizes  are  most  often  separated. 
In  classical  gel  electrophoresis,  electrically  charged 
macromolecutes  are  caused  to  migrate  through  a 
polymeric  gel  under  the  influence  of  an  imposed  sta- 
tic electric  field.  In  time  the  molecules  sort  them- 
selves by  size,  since  the  smaller  ones  move  more 
rapidly  through  the  gel  than  do  larger  ones.  In  1984 
a  lurther  advance  was  made  with  the  invention  of 
pulsed-field  gel  electrophoresis,  in  which  the 
strength  and  direction  of  the  applied  fieU  is  varied 
rapidly,  thus  allowtng  DNA  strands  of  more  than 
50,000  base  pairs  to  be  separated. 
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A  third  rwceHaiy  Inol  is  some  means  of 
ONA  'ampBfkation.*  The  dassk  example  is  the 
chming  VfXtar,  whkh  may  be  CRCjdar  ONA  mole- 
tules  derived  from  bacteria  or  from  bacteri^Aages 
(vituslike  parasires  of  baoetia),  or  artificial  chro- 
mosomes constructed  from  yeast 
<tr  bacterial  jenomic  ONA.  Xhe 
characteristic  alt  these  vectors 
share  is  that  fragments  of  'foreign' 
ONA  can  be  inserted  into  them, 
whereby  the  Inserted  ONA  is 
rqjlicated  along  with  the  rest  of 
the  vector  as  *e  host  repnaduces 
itsetf.  A  yeast  arttlicial  chromo- 
some. Of  VAC,  for  instance,  is 
constructed  by  assembling  the 
csseniiat  feinctional  parts  of  a  nat- 
ural yeast  chromosome — ONA 
sequences  that  initiate  replication, 
sequerttes  that  mark  the  ends  of 
the  chromosomes,  and  sequences 
required  for  chromosome  separa- 
tion during  cell  division— then  splicing  in  a  frag- 
ment of  human  ONA.  This  enpneered  chromo- 
some (s  then  reinserted  into  a  yeast  cell,  which 
reproduces  the  YAC  during  celt  division,  as  if  it 
were  part  of  the  yeasfs  nofmal  complement  of 
chromosomes.  The  result  is  a  colony  of  yeast  cells, 
each  containing  a  copy,  or  done,  of  the  same 
fragment  of  human  ONA.  One  of  the  Important 
achievements  of  ttw  Human  Genome  Prajeci  has 
been  to  establish  several  libraries  of  such  cloned 
fragments,  using  several  different  vectors  (bacterial 
artificial  chromosomes,  Pt  phages,  and  PI -derived 
cloning  systems),  that  cover  the  entire  human 
genome. 

Another  way  of  amplifying  ONA  !v . ' «  poly- 
merase chain  reacSon,  or  PCR.  This  enzymatk 
repticatton  technitjue  requires  that  inftiators,  or 
PCR  primers,  be  attached  as  short  complementary 
strands  at  the  ends  of  the  separated  ONA  f  ragnents 
to  be  replicated.  An  enzyme  #>en  completes  the 
synthesis  of  the  complementary  strands,  thus  dou- 


bling the  amount  of  DN  A  oripnalty  present  Again 
and  again,  the  strands  can  <>e  separated  and  the 
polymerase  reaction  repeated — so  eKectively,  in 
tact,  that  ONA  can  be  amplified  by  100,000-fold  in 
less  than  three  hours.  As  with  cloning  vectors,  the 
result  is  a  large  collection  of  copies  of  the  original 
ONA  fragment. 

When  a  done  library  can  be  ordered— that 
is,  when  the  retath-e  positions  on  the  human  chro- 
mosomes can  be  established  for  all  the  fragments- 
one  then  has  the  pcrfed  resource  for  achieving  the 
project's  central  goal,  sequencing  the  human 
genome.  How  tlw  sequencing  is  actually  done  can 
be  illustrated  by  the  most  popjlar  method  in  cur- 
rent use,  the  Sanger  procedure,  which  is  depicted 
schematically  on.  the  facing  page.  The  first  step  is 
to  prime  each  identical  ONA  strand  in  a  prepara- 
tion of  cloned  fragments.  The  preparation  is  then 
divided  into  four  portions,  each  of  which  contains 
a  different  reaction-terminating  nudeotide, 
togettter  with  the  usual  reagents  for  replication.  In 
one  batch,  the  replication  reaction  always  pro- 
duces compiementaty  strands  that  end  with  A;  in 
another,  wiSi  C;  and  so  on.  Gel  electrophoresis  is 
used  to  sift  the  resulting  products  according  to  size, 
allowing  one  to  infer  the  exact  nucleotide 
sequence  for  the  original  ONA  strand.  ♦ 

tt^UllMS  OinriHI  AH*Wtt>  tolbliMk- 
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tnJtfriaitM<i<iiAtnibt,nikMimmM\i^dltf 
daf  aiHIi  A<  Ani  u  U€«J  e.|  M  ilKtrtfkansli— «m 
h»  f«  rncttn  m/aOKt—b  Am  ih^  i«  sapiioti  At  r<|tf- 
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Xn«l(«k<iif«»i 
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Marked  progress  is  also  evident  in  the 
development  of  sequencing  technologies, 
though  all  of  those  in  wndespread  current  use 
are  still  based  on  methods  developed  in  1977 
by  Allan  Maxam  and  Walter  Gilbert  and  by 
Frederick  Sanger  and  his  coworkers  (see 
"Tools  of  the  Trade,"  pages  17-19).  Both  of 
these  methods  rely  on  gel-based  elec- 
trophoresis systems  to  separate  DNA  frag- 
ments, and  recent  advances  in  commercial 
systems  include  increasing  the  number  of  gel 
lanes,  decreasing  run  times,  and  enhancing 
the  accuracy  of  base  identification.  As  a 
result  of  such  improvements,  a  standard 
sequencing  machine  can  now  turn  out  raw, 
unverified  sequences  of  50.000  to  75,000 
bases  per  day. 

Equally  important  to  the  sequencing 
goals  of  the  genome   project   is  a  rational 
system  for  organizing  and  distributing  the 
material  to  be  sequenced.    The  DOE's  com- 
mitment  to   such    resources   dates   back   to 
198*^,    when    it   organized 
the    National    Laboratory 
Gene      Library      Project. 
Based  on  cell-  and  chromo- 
some-sorting   technologies 
developed    at    Livermore 
and  Los  Alamos,  libraries 
of  clones  were  established 
for    each    of   the    human 
chromosomes,  and  the  indi- 
vidual   clones   are   widely 
available  for  mapping  and 
for  isolating  genes.    These 
clones   were  invaluable   in 
such  notable  "gene  hunts"  as  the  successful 
searches     for     the     cystic     fibrosis     and 
Huntington  disease  genes.  More  recently,  as 
more  efTicient  vectors  have  become  available, 
complete  human  DNA  libraries  have  been 
established  using  BACs.  PACs,  and  YACs. 

Another  critical  resource  is  being 
assembled  in  an  effort  known  as  I.M.A.G.E. 
(Integrated  Molecular  Analysis  of  Genomes 
and  their  Expression),  cofounded  by  the 
Livermore  Human  Genome  Center.  The  aim 
is  a  master  set  of  mapped  and  sequenced 
human  cDNA,  representing  the  expressed 
parts  of  the  human  genome.    By  early  1996, 


Advances 

have  pronffht 

much  nearer 

the  day  when 

"production 

sequencing 

can  begin. 


I.M.A.G.E.  had  distributed  over  250,000 
partial  and  complete  cDNA  clones,  most  of 
them  with  one  or  both  ends  sequenced  to 
provide  unique  identifiers.  These  identifiers, 
cxprcjjed  j^qtieme  tagj  (ESTs),  are  usually 
300-500  base  pairs  each.  Twenty-five  hun- 
dred genes  have  also  been  newly  mapped  as 
part  of  this  coordinated  effort. 

Shotguns   and  transposons 

Such  advances  as  these,  in  both  tech- 
nology development  and  the  assembly  of 
resource  libraries,  have  brought  much  nearer 
the  day  when  "production  sequencing"  can 
begin.  A  great  deal  of  variety  remains,  how- 
ever, in  the  approaches  available  to  sequenc- 
ing the  human  genome,  and  it  is  not  yet  clear 
which  will  prove  the  most  efficient  and  most 
cost-effective  way  to  read  long  stretches  of 
DNA  over  the  next  decade.  One  of  the  avail- 
able choices,  for  example,  is  between  "shot- 
gun" and  "directed"  strategies.  Another  is 
the  degree  of  redundancy  — that  is,  how 
many  times  must  a  given  strand  be  sequenced 
to  ensure  acceptable  confidence  in  the  result? 

Shotgun  sequencing  derives  its  name 
from  the  randomly  generated  DNA  frag- 
ments that  are  the  objects  of  scrutiny.  Many 
copies  of  a  single  large  clone  are  broken  into 
pieces  of  perhaps  1500  base  pairs,  either  by 
restriction  enzymes  or  by  physical  shearing. 
E^ch  fragment  is  then  separately  cloned,  and 
a  convenient  portion  of  it  sequenced.  A  com- 
putational assembly  process  then  compares 
the  terminal  sequences  of  the  many  frag- 
ments and,  by  finding  overlaps  that  indi- 
cate neighboring  fragments,  constructs  an 
ordered  library  for  the  parent  clone.  The 
members  of  this  ordered  library  can  then  be 
sequenced  from  end  to  end  to  yield  a  com- 
plete sequence  for  the  parent.  The  statistics 
involved  in  taking  this  approach  require  that 
many  copies  of  the  original  clone  be 
randomly  fragmented,  if  no  gaps  are  to  be 
tolerated  in  the  final  sequence.  A  benefit  is 
that  the  final  sequence  is  highly  reliable;  the 
main  disadvantage  is  that  the  same  sequence 
must  be  done  many  times  (in  the  many  over- 
lapping fragments).     Nevertheless,  shotgun 


V^ 


365 


Exploring  ui«  Genomic  LanJtoap* 


sequencing  has  been  the  primary  means  for 
generating  most  of  the  genomic  sequence 
data  in  public  DNA  databases.  This  includes 
the  longest  contiguous  fragment  of  se- 
quenced human  DNA,  from  the  human 
T-cell  receptor  beta  region,  of  about  685,000 
base  pairs  — a  product  of  DOE-supported 
work  at  the  University  of  Washington. 

The  shotgun  strategy  is  also  being  used 
at  the  Genome  Therapeutics  Corporation  and 
The  Institute  for  Genomic  Research  (TIGR), 
as  part  of  the  DOE-supported  Microbial 
Genome  Initiative.  Genome  Therapeutics 
has  sequenced  1 .8  million  base  pairs  of 
MetbarwbacUrium  tbernwaiLtotropbictun,  a  bac- 
terium important  in  energy  production  and 
bioremediation,  and  TIGR  has  successfully 
sequenced  the  complete  genomes  of  three 
free-living  bacteria,  Haenwpbiltuf  uifUunzat 
(1,830,137  base  pairs;  an  effort  supported 
mostly  by  private  funds),  Mycaplojmu genila- 
luim  (580,070  base  pairs),  and  Metbarwcocau 
jannojcbii  (1,739,933  base  pairs). 

The  alternative  to  shotgun  sequencing 
is  a  directed  approach,  in  which  one  seeks  to 
sequence  the  target  clone  from  end  to  end 
with  a  minimum  of  duplication.  The  essence 
of  this  approach  is  embodied  in  a  technique 
known  as  primer  ivalking.  Starting  at  one  end 
of  a  single  large  fragment,  one  replicates  a 
stretch  of  DNA— say,  400  base  pairs  long- 
that  can  be  sequenced  in  one  run.  With  the 
sequence  for  this  first  segment  in  hand,  the 
next  stretch  of  DNA,  just  overlapping  the 
first,  is  then  tackled  in  the  same  way.  In  prin- 
ciple, one  can  thus  "walk "  the  entire  length  of 
the  original  clone.  Unfortunately,  this  con- 
ceptually simple  approach  has  been  histori- 
cally beset  with  disadvantages,  mainly  the 
expense  and  inconvenience  of  custom- 
synthesizing  a  primer  as  the  necessary  start- 
ing point  for  each  sequencing  step.  The 
widely  automated  Sanger  sequencing  method 
involves  a  DNA  replication  step  that  must  be 
"primed"  by  a  DNA  fragment  that  is  comple- 
mentary to  15  to  20  base  pairs  of  the  strand  to 
be  sequenced  (see  "Tools  of  the  Trade,"  pages 
17-19).  Until  recently,  making  these  primers 
was  an  expensive  and  time-consuming  busi- 
ness,   but    recent   innovations    have    made 
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primer  walking,  and  similar  directed  strate- 
gies, more  and  more  economically  feasible. 

One  way  to  deal  with  the  primer  bottle- 
neck, for  example,  is  to  use  sets  of  very  short 
fragments  to  prime  the  next  sequencing  step. 
As  an  illustration,  the  four  nucleotides  (A,  T, 
C,  and  G)  can  be  ordered  in  more  than  68  bil- 
lion ways  to  create  an  18-base  primer,  an 
imposing  set  of  possibilities.  But  it  is  emi- 
nently practical  to  create  a  library  of  the  4096 
possible  6-base  primers.  Three  of  these 
"6-mers"  can  be  matched  to  the  end  of  the 
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fragment  to  be  sequenced,  thus  serving  as  an 
18-base  primer.  This  modular  primer  tech- 
nology, developed  at  the  Brockhaven 
National  Laboratory,  is  currently  being 
applied  to  Borrelia  burgdorferi,  the  organism 
that  causes  Lyme  disease;  a  34,000-base-pair 
fragment  has  already  been  sequenced. 

Another  directed  approach  uses  a  natu- 
rally occurring  genetic  element  called  a  tratw- 
pojon,  which  insinuates  itself  more  or  less  ran- 
domly in  longer  DNA  strands.  This  predilec- 
tion for  random  insertion  and  the  fact 
that  the  transposon's  DNA  sequence  is  well 
known  are  the  keys  to  the  sequencing 
strategy  depicted  schematically  in  Figure  7 . 
The  largest  clones  are  broken  into  smaller 
subclones  (each  of  about  3000  base  pairs), 
which  then  become  the  targets  of  the  trans- 
posons.  Multiple  copies  of  each  subclone  are 
exposed  to  the  transposons,  and  reaction 
conditions  are  controlled  to 
v..  ...::.v,:w...-..w:::.:  yields  Qn  avcrage,  a  single 
Insertion  in  each  3000-base- 
pair  strand.  The  individual 
strands  are  then  analyzed  to 
yield,  for  each,  the  approxi- 
mate position  of  the  inserted 
transposon.  By  mapping 
these  positions,  a  "minimum 
tiling  path"  can  be  deter- 
mined for  each  subclone  — 
that  is,  a  set  of  strands  can  be 
identified  whose  transposon 
insertions  are  roughly  300 
base  pairs  apart.  In  this  set 
of  strands,  the  region  around 
each  transposon  is  then  sequenced,  using  the 
inserted  transposons  as  starting  points.  The 
known  transposon  sequence  allows  a  single 
primer  to  be  used  for  sequencing  the  full  set 
of  overlapping  regions. 

At  the  Lawrence  Berkeley  National 
Laboratory,  this  technique  has  been  used  to 
sequence  over  L5  million  base  pairs  of  DNA 
on  human  chromosomes  5  and  20,  as  well  as 
over  three  million  base  pairs  from  the  fruit  fly 
Drojopbila  meUuwgojter .  On  chromosome  5, 
interest  focuses  on  a  region  of  three  million 
base  pairs  that  is  rich  in  growth  factor  and 
receptor  genes;  whereas,  on  chromosome  20, 


Berkeley  researchers  are  interested  in  a 
region  of  about  two  million  base  pairs  that  is 
implicated  in  15  to  20  percent  of  all  primary 
breast  carcinomas.  As  an  example  of  the 
kind  of  output  these  efforts  produce,  Figure 
8  shows  a  stretch  of  sequence  data  from  chro- 
mosome 5. 

Researchers  supported  by  the  DOE  at 
the  University  of  Utah  are  also  pursuing  the 
use  of  directed  sequencing.  In  addition,  they 
have  developed  a  methodology  for  "multi- 
plex" DNA  sequencing,  which  offers  a  way 
of  increasing  throughput  with  either  shot- 
gun or  directed  approaches.  By  attaching  a 
unique  identifying  sequence  to  each  sequenc- 
ing sample  in  a  mixture  of.  say,  50  such  sam- 
ples, the  entire  mixture  can  be  analyzed  in  a 
single  electrophoresis  lane.  The  50  samples 
can  be  resolved  sequentially  by  probing,  first, 
for  bands  containing  the  first  identifier,  then 
for  bands  containing  the  second,  and  so 
forth.  In  a  similar  way,  multiplexing  can  also 
be  used  for  mapping.  The  Utah  group  is  now 
able  to  map  almost  5000  transposons  in  a  sin- 
gle experiment,  and  they  are  using  multiplex- 
ing in  concert  with  a  directed  sequencing 
strategy  to  sequence  the  1.8  million  base 
pairs  of  the  thermophilic  microbe  Pyrococau 
furiMiu  and  two  important  regions  of  human 
chromosome  17. 

The  completed  physical  maps  of  chro- 
mosomes 16  and  19,  with  their  extensive 
coverage  in  many  different  kinds  of  cloning 
vectors,  are  especially  ripe  for  large-scale 
sequencing.  Los  Alamos  scientists  have 
therefore  begun  sequencing  chromosome  16, 
focusing  special  effort  on  locating  the  esti- 
mated 3000  expressed  genes  on  that  chromo- 
some and  using  those  sites  as  starting  points 
for  directed  genomic  sequencing.  A  region  of 
60,000  base  pairs  has  already  been  sequenced 
around  the  adult  polycystic  kidney  gene,  and 
good  starts  have  been  made  in  mapping  other 
genes.  Interestingly,  even  random  sequenc- 
ing has  led  to  the  identification  of  gene  DNA 
in  over  15  percent  of  the  samples,  confirming 
the  apparent  high  density  of  genes  on  this 
chromosome.  Between  chromosome  16  and 
the  short  arm  of  chromosome  5,  another 
Los  Alamos  target,  the  genome  center  there 
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has  produced  almost  two  million  base  pairs  of 
human  DNA  sequence. 

A  parallel  effort  is  under  way  at 
Livermore  on  chromosome  19  and  other  tar- 
geted genomic  regions.  Using  a  shotgun 
approach,  researchers  there  have  completed 
over  1.3  million  bases  of  genomic  sequence. 
Initially,  they  are  attacking  two  major  regions 
of  chromosome  19:  one  of  about  two  million 
base  pairs,  containing  several  genes  involved 
in  DNA  repair  and  replication,  and  another 
of  approximately  one  million  base  pairs, 
containing  a  kidney  disease  gene.  The 
Livermore  scientists  are  making  use  of  the 
LM.A.G.E.  cDNA  resource  to  sequence  the 
cDNA  from  these  regions,  along  with  the 
associated  segments  of  the  genome.  In  addi- 
tion, Livermore  scientists  have  targeted 
DNA  repair  gene  regions  throughout  the 
genome  and,  in  many  cases,  have  done  com- 
parative sequencing  of  these  genes  in  other 
Contimicj  on  p.  26 
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The  Mighty  Mouse 
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The  human  genome  is  not  so  ve»y 
(fiffefenl  from  thai  ol  ehln^niees  or 
micf,  and  it  even  shares  many  com- 
mon elements  with  the  genome  of  (he 
lowly  feuit  fly.  Obviously,  the  dWfef- 
ences  are  critical,  hui  so  are  the  simlUritles.  in 
particular,  genetic  experiments  on  other  organisms 
can  IBuminale  much  that  we  co«ld  not  otherwise 
learn  about  homohgoui  human  genes — that 
is,  genes  that  are  basically  the  same  in  the 
two  species. 

in  some  cases,  the  connection  between  a 
newly  identffied  human  gene  and  a  known  health 
disorder  can  be  quickly  established.  More  often, 
however,  ciear  links  between  cloned  genes  arwi 
human  hereditary  diseases  or  disease  susceptibili- 
ties are  extremely  elusive.  Diseases  that  are  mod- 
ified by  other  genetic  predispositions,  lor  example, 
or  by  environment,  diet,  and  lifestyle  can  be 
exceedingly  <BKiculi  to  trace  in  human  families. 
The  same  holds  for  very  rare  diseases  and  for 
genetic  factors  contributing  to  bitth  defects  and 
other  developmental  disorders.  By  contrast,  disor- 
ders such  as  these  can  sometimes  be  followed 
relatively  easily  in  animal  systems,  where  uniform 
genetic  background  and  controlled  breeding 
schemes  can  be  used  to  avoid  the  variabiiity  Ihal 
often  confounds  human  population  studies.  As  a 
consequence,  researchers  looking  for  dues  to  the 
causes  ti  many  complex  health  problems  are 
focusing  ntore  and  more  attention  on  model  ani- 
mal systems. 

Among  such  systems,  which  range  in  cotn- 
ptexity  from  yeast  and  bacteria  to  mammals,  the 
most  promlneni  is  the  mouse.  Secause  of  its  small 
Aie,  high  fertility  rate,  and  expetHomial  manlpu- 
lability,  the  mouse  offers  great  promise  m  studying 
the  genetic  causes  and  pathological  progress  of 
ailments,  as  well  as  understanding  the  genetic  role 
in  disease  susceptibility.  In  pursuing  such  studies, 
the  DOS  Is  exploiting  several  resourees,  among 
them  the  expefBuental  mouse  genetio  facility  at 
the  Oak  Ridge  National  Laboratory.     Inittally 


estattlshed  fo»  genetic  nsk  assessment  and  toxi- 
cology  studies,  the  Oak  Ridge  facility  is  one  of  the 
world's  largest.  Mutant  grains  there  express  a 
variety  of  inherited  developmental  and  heafth  tBs- 
oitfers,  ranging  from  dwarfism  and  limb  deformi- 
ties to  sickle  cell  anemia,  atherosclefosis,  and 
unusual  susceptibilities  to  cancer. 

Most  of  these  existing  mutant  strains  have 
arisen  from  random  alterations  o)  genes,  caused 
by  the  same  processes  that  occur  naturally  in  all 
Ihiing  populations.  However,  Other,  more  directed 
means  of  gene  alteration  are  also  available.  So- 
called  ItMsgenk  metfiods,  which  have  been 
developed  and  refined  Over  the  past  15  years, 
allow  ONA  sequences  engineered  in  tbe 
laboratory  lo  be  introcfaced  directly  Into  the 
genomes  of  mouse  embryos.  The  embryos  are 
subsequently  transferred  to  a  foster  mother,  where 
they  develop  into  mice  carrying  specifically 
designed  alterations  in  a  particular  gene.  The  dif- 
ferences in  form,  basic  health,  fertility,  and 
longevity  produced  by  these  "designer  mutations* 
then  allow  reseanthers  to  study  the  effects  Of 
genetic  defects  Iha  can  mimic  those  found  in 
human  patients.  The  payoff  can  be  dues  that  aid 
in  the  design  ol  drags  and  other  treatments  for  the 
human  diseases. 

The  Human  Cenome  Center  at  Berkeley  is 
using  mice  for  similar  purposes.  In  vivo  libraries 
of  overiapping  human  genome  fragments  (each 
100,000  to  1,000,000  base  pairs  long)  are  being 
propagated  in  transgenic  mice.  Tlie  region  of 
chromosome  21  responsible  for  Down  syndrome, 
for  example,  is  now  almost  fully  rqjresenled  in  a 
panel  of  tran^nit  mice.  Such  libraries  have  sev- 
eral uses,  for  exan^e,  the  precise  biochemical 
means  by  which  identified  genes  pioduce  their 
effects  can  be  studied  in  detail,  and  new  genes 
can  be  recognised  by  analyzing  the  effects  of 
particular  genome  fragments  on  the  transgenic 
animals.  In  such  ways,  the  promise  of  the  nwsiive 
effort  to  map  and  sequence  the  human  genome 
can  be  ttartslated  into  the  kind  of  biological 
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knowledge  coveted  by  pharmaceutkal  designers 
and  medical  reieafchers. 

Adding  to  the  paential  value  of  mutant 
mice  as  models  for  human  genetic  tBsease  is  grow- 
ing  evidence  of  siritilarities  between  mouse  and 
human  genes.  Indeed,  practicaily  every  human 
gene  appeals  to  have  a  counterpart  In  the  nrotise 
genome.  Furthermore,  S>e  related  mouse  and 
human  genes  often  share  very  similar  ONA 
se<}uences  and  the  sanrn  basic  biological  fiinclion. 
If  vve  imagine  that  the  23  pairs  of  human  chromo- 
somes were  shattered  into  smaller  blocis— to  yield 
a  total  of,  say,  150  pieces,  ranging  in  size  from 
very  smati  bits  containing  just  a  few  genes  to 
whole  chromosome  arms — those  pieces  could  be 
reassembled  to  proifoce  a  serviceable  model  of  the 
mouse  genome.  This  mouse  genome  jigsaw  pui- 
zk  is  shown  to  the  ri^t.  Thanks  to  this  mouse- 
human  genomic  homofogy,  a  newly  bcated  gene 
on  a  human  chromosome  can  olten  lead  to  a  con- 
fident pretfiction  of  where  a  closely  related  gene 
will  be  found  in  the  mouse — and  vice  versa. 

Thus,  a  crippling  heritable  muscle  tfisorder 
in  mice  maps  to  a  location  on  the  mouse  X  chro- 
mosome that  is  closely  analogous  to  the  map  loca- 
tion for  the  X-ltnked  human  Duchenne  muscular 
dystrophy  gene  (OMD).  Indeed,  we  now  know 
that  these  two  similar  diseases  are  cattsed  i)y  the 
mouse  and  human  versions  of  the  same  gene. 
Although  mutations  in  the  mouse  mdx  gene  pro- 
duce a  muscle  disease  that  is  less  sevens  than  the 
heartbreaking,  fatal  disease  resulting  from  the 
DMD  mutation  m  humans,  the  two  genes  produce 
proteins  that  function  in  very  similar  ways  and  that 
are  ctearty  required  for  normal  muscle  develop- 
ment and  function  in  the  corresponding  species. 
Likewise,  the  discovery  of  a  mouse  gene  assocl- 
ated  with  pigmentation,  reproductive,  and  blood 
cell  defects  was  the  cnicial  key  to  uncovering  the 
basis  for  a  human  disease  known  as  the  piebald 
trait.  Owing  to  such  close  human-mouse  relation- 
ships as  these.  tog«hef  with  the  benefits  of  trans- 
genic technologies,  the  mouse  offers  enormous 
potential  in  identifying  new  huntan  genes,  deci- 
pherii^  their  complex  hinaiorts.  and  even  treating 
genetic  diseases.  ♦ 
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species,  especially  the  mouse.  Such  compara- 
tive sequencing  has  identified  conserved 
sequence  elements  that  might  act  as  regulatory 
regions  for  these  genes  and  has  also  assisted  in 
the  identification  of  gene  function  (see  "The 
Mighty  Mouse,"  pages  24-25). 

How    GOOD    IS    GOOD    ENOUGH? 

The  goal  of  most  sequencing  to  date 
has  been  to  guarantee  an  error  rate  below 
1  in  10,000,  sometimes  even  1  in  100,000. 
However,  the  difference  between  one  human 
being  and  another  is  more  like  one  base  pair  in 
five  hundred,  so  most  researchers  now  agree 
that  one  error  in  a  thousand  is  a  more  reason- 
able standard.  To  assure  a  higher  level  of  con- 
fidence, and  perhaps  to  uncover  important 
individual  differences,  the  most  biologically  or 
medically  important  regions  would  still  be 
sequenced  more  exhaustively,  but  using  this 
lowered  standard  would  greatly  reduce  the  cost 
of  acquiring  sequence  data  for  the  bulk  of 
human  DNA. 

With  this  philosophy  in  mind,  Los 
Alamos  scientists  have  begun  a  project  to 
determine  the  cost  and  throughput  of  a  low- 
redundancy  sequencing  strategy  known  as 
jampU  jeqiuncing  (SASE,  or  "sassy").  Clones 
are  selected  from  the  high-resolution  Los 
Alamos  cosmid  map,  then  physically  broken 
into  3000-base-pair  subclones  — much  as  in 
other  sequencing  approaches.  In  contrast  to, 
say,  shotgun  sequencing,  though,  only  a  small 
random  set  of  the  subclones  is  then  selected  for 
sequencing.  Sequence  fragments  already 
known  — end  sequences,  sequence-tagged  sites, 
and  so  forth— are  used  as  the  starting  points. 
The  result  is  sequence  coverage  for  about  70 
percent  of  the  original  cosmid  clone,  enough  to 
allow  identification  of  genes  and  E^Ts,  thus 
pinpointing  the  most  critical  targets  for  later, 
more  thorough  sequencing  efforts.  Further, 
the  SASE^derived  sequences  provide  enough 
information  for  researchers  elsewhere  to  pur- 
sue just  such  comprehensive  efforts,  using 
whole  genomic  DNA.  In  addition,  the  cost  of 
SASE  sequencing  is  only  one-tenth  the  cost  of 
obtaining  a  complete  sequence,  and  a  genomic 
region  can  be  "sampled"  ten  times  as  fast. 


As  the  first  major  target  of  SASE  analy- 
sis, Los  Alamos  scientists  chose  a  cosmid 
contig  of  four  million  base  pairs  at  the  end 
(the  telomere)  of  the  short  arm  of  chromosome 
16.  By  early  1996,  over  1.4  million  base  pairs 
had  been  sequenced,  and  a  gene,  EST,  or  sus- 
pected coding  region  had  been  located  on 
every  cosmid  sampled. 

In  addition,  Los  Alamos  is  building  on 
the  SASE  effort  by  using  SASE  sequence 
data  as  the  basis  for  an  efficient  primer  walk- 
ing strategy  for  detailed  genomic  sequencing. 
The  first  application  of  this  strategy,  to  a 
telomeric  region  on  the  long  arm  of  chromo- 
some 7,  proved  to  be  as  efficient  as  typical 
shotgun  sequencing,  but  it  required  only 
two-  to  threefold  redundancy  to  produce 
a  complete  sequence,  in  contrast  to  the 
seven-  to  tenfold  redundancy  required  in 
shotgun  approaches.  The  resulting  230,000- 
base-pair  sequence  is  the  second-longest 
stretch  of  contiguous  human  DNA  sequence 
ever  produced. 


In  a  sense,  though,  even  a  complete  genome 
sequence— the  ultimate  physical  map  — is 
only  a  start  in  understanding  the  human 
genome.  The  deepest  mystery  is  how  the 
potential  of  100,000  genes  is  regulated  and 
controlled,  how  blood  cells  and  brain  cells  are 
able  to  perform  their  very  different  functions 
with  the  same  genetic  program,  and  how 
these  and  countless  other  cell  types  arise  in 
the  first  place  from  an  single  undifferentiated 
egg  cell.  A  first  step  toward  solving  these 
subtle  mysteries,  though,  is  a  more  complete 
physical  picture  of  the  master  molecules  that 
lie  at  the  heart  of  it  all. 
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I  N  STK  U  M  ENTATI  e  N     AND     I  N  P  O  ■  JM  A  T  I  C  S 


FROM  THESTAKT,  it  has  been  clear 
that  the  Human  Genome  Project 
would  require  advanced  instru- 
mentation and  automation  if  its 
mapping  and  sequencing  goals 
were  to  be  met.  And  here,  especially,  the 
doe's  engineering  infrastructure  and  tradi- 
tion of  instrumentation  development  have 
been  crucial  contributors  to  the  international 
effort.  Significant  DOE  resources  have  been 
committed  to  innovations  in  instrumentation, 
ranging  from  straightforward  applications  of 
automation  to  improve  the  speed  and  effi- 
ciency of  conventional  laboratory  protocols 
(see,  for  example.  Figure  9a)  to  the  develop- 
ment of  technologies  on  the  cutting  edge  — 
technologies  that  might  potentially  increase 
mapping  and  sequencing  efficiencies  by 
orders  of  magnitude. 

On  the  first  of  these  fronts,  genome 
researchers  are  seeing  significant  improve- 
ments in  the  rate,  efficiency,  and  economy  of 
large-scale  mapping  and  sequencing  efforts 
as  a  result  of  improved  laboratory  automa- 
tion tools.  In  many  cases,  commercial  robots 
have  simply  been  mechanically  reconfigured 
and  reprogrammed  to  perform  repetitive 
tasks,  including  the  replication  of  large  clone 
libraries,  the  pooling  of  libraries  as  a  prelude 
to  various  assays,  and  the  arraying  of  clone 
libraries  for  hybridization  studies.  In  other 
cases,  custom-designed  instruments  have 
proved  more  efficient.  A  notable  illustra- 
tion is  the  world's  fastest  cell  and  chromo- 
some sorter,  developed  at  Livermore  and 
now  being  commercialized,  which  is  used 
to  sort  human  chromosomes  for  chromo- 
some-specific  libraries.      Other   examples 


include  a  high-speed,  robotics-compati- 
ble thermal  cycler  developed  at  Berkeley, 
which  greatly  accelerates  PCR  amplifica- 
tions, and  instruments  developed  at  Utah 
for  automated  hybridization  in  multiplex 
sequencing  schemes. 

Smaller   is   better  — and 
other  developments 

Beyond  "mere"  automation  are  efforts 
aimed   at   more   fundamental   enhancements 
of  established  techniques.     In  particular,  a 
number  of  DOE-supported   efforts  aim   at 
improved   versions   of  the   automated   gel- 
based  Sanger  sequencing  tech- 
nique.   For  example,  in  place  of 
the  conventional  slab  gels,  ultra- 
thin  gels,  less  than  O.I  millime- 
ter thick,  can  be  used  to  obtain 
400  bases  of  sequence  from  each 
lane  in  a  hour's  run,  a  fivefold 
improvement     in     throughput 
over      conventional      systems. 
Even  faster  speedups  are  seen 
when   arrays   of  0.1 -millimeter 
capillaries  are  used  as  the  sepa- 
ration medium.     Both  of  these 
approaches  exploit  higher  elec- 
tric field  strengths  to  increase  DNA  mobility 
and  to  reduce  analysis  times.  And  Livermore 
scientists  are  looking  beyond  even  capillar- 
ies,   to    sequencing    arrays    of   rigid    glass 
microchannels,  supplemented  by  automated 
gel  and  sample  loading. 

The  capillary  approach  is  especially 
ripe  for  further  development.  Challenges 
include   providing   uniform   excitation   over 


The  project  win 
require  advanced 
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if  its  goals  are 
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ififlereat  aadeotide  ieqaa«es  (probes),  vrbicb  wos  tkea  iicabated  witk  a  ikaed  ftogneat  (tke  target)  trow  tke 
geaoBie  of  tke  HIV-1  viras.  If  tke  fhotesceatly  hbeled  toiget  coatabied  a  regioa  conpleneatory  to  a  segaeace  h 
tke  array,  tke  tsgat  kybridhed  wHk  tke  probe,  tke  exteat  ol  tke  kybridiiatioa  depeadiig  oa  tke  exteat  ol  tke 
natik.  Tkis  labe-color  bnagt  depicts  dHlereat  levels  of  detected  fboresceice  Iron  tke  booad  target  hagaiMts. 
Teckai<|aes  sack  as  tkis  oiay  altinataly  ke  ased  h  segaeaciag  oppUcatioas,  as  wel  os  b  eiploriog  geaetic  diversity, 
probiag  lor  aiatotioas,  aad  detectkg  specific  patkogeas.  Pkoto  coartesy  of  Affyaietrii.  (c|  Seqoeaclag  based  oa 
tbe  detectioa  of  fhoresceice  froai  slagle  aiolecales  b  beiag  parsaed  at  Los  Alanos.  Tke  straad  ol  DNA  to  be 
seqaeaced  b  repeated  asiag  aacleotides  Baked  to  a  fhoresceat  tag— a  dlHeteat  tag  for  eack  ol  tke  foar 
aacleotides.  Tke  togged  sttoad  b  tkea  attacked  to  a  polystyreae  keod  saspaided  li  a  flowiig  strew  ol  wotar, 
■d  tke  eodeotides  ve  eazyaiaticaly  detockei  oae  at  a  tine,  loser-excited  fhoresceace  tkea  yields  tke 
iidaotide  seqaaace,  base  by  base.  Macb  developaieat  reaii^s  to  ke  doae  oa  tkb  teckaiqae,  bat  saccess  pionbes 
a  cbeqier,  faster  apptoock  to  seqaaacta^  oae  tkat  oiigkt  be  opplciMe  tt  Mact  cosnid  cknes  40,000  kases  loag. 
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arrays  of  50  to  100  capillaries  and  then 
efficiently  detecting  the  fluorescence  emitted 
by  labeled  samples.  Technologies  under 
investigation  include  fiber-optic  arrays,  scan- 
ning confocal  microscopy,  and  cooled  CCD 
cameras.  Some  of  this  effort  has  already 
been  transferred  to  the  private  sector,  and 
tenfold  improvements  in  speed,  economy, 
and  efficiency  are  projected  in  future 
commercial  instruments. 

The  move  toward  miniaturization  is 
afoot  elsewhere  as  well.  Building  on  experi- 
ences in  the  electronics  industry,  several 
DOE-supported  groups  are  exploring  ways 
to  adapt  high-resolution  photolithographic 
methods  to  the  manipulation  of  minuscule 
quantities  of  biological  reagents,  followed  by 
assays  performed  on  the  same  "chip." 
Current  thrusts  of  this  "nanotechnology" 
approach  include  the  design  of  microscopic 
electrophoresis  systems  and  ultrasmall-vol- 
ume,  high-speed  thermal  cycling  systems  for 
PCR.  A  miniaturized,  computer-controlled 
PCR  device  under  development  at  Livermore 
operates  on  9-volt  batteries  and  might  ulti- 
mately lead  to  arrays  of  thousands  of  individ- 
ually controlled  micro-PCR  chambers. 

Another  miniaturization  effort  aims  at 
the  fabrication  of  high-density  combinatorial 
arrays  of  custom  aligomerj  (short  chains  of 
nucleotides),  which  would  make  feasible 
large-scale  hybridization  assays,  including 
sequencing  by  hybridization.  This  innova- 
tive technique  uses  short  oligomers  that 
pair  up  with  corresponding  sequences  of 
DNA.  The  oligomers  are  placed  on  an  array 
by  a  process  similar  to  that  of  making  silicon 
chips  for  electronics.  Successful  matches 
between  oligomers  and  genomic  DNA  are 
then  detected  by  fluorescence,  and  the  appli- 
cation of  sophisticated  statistical  analyses 
reassembles  the  target  sequence.  This  same 
technology  has  already  been  used  for  genetic 
screening  and  cDNA  fingerprinting.  Figure 
9b  illustrates  a  DOE-supported  application 
of  high-densi^  oligonucleotide  arrays  to  the 
detection  of  mutations  in  the  HIV-1  genome. 
Similar  approaches  can  be  envisioned  to 
understand  differences  in  patterns  of  gene 
expression:  Which  genes  are  active  (which 


are  producing  mRNA)  in  which  cells?  Which 
are  active  at  different  times  during  an  organ- 
ism's development?  Which  are  active,  or  inac- 
tive, in  disease? 

Sequencing  by  hybridization  is  only  one 
of  several  forward-looking  ideas  for  revolu- 
tionizing sequencing  technology.  In  spite  of 
continuing  improvements  to  sequencers  based 
on  the  classic  methods,  it  is 
nonetheless  desirable  to  explore 
altogether  new  approaches,  with 
an  eye  to  simplifying  sample 
preparation,  reducing  measure- 
ment times,  increasing  the  length 
of  the  strands  that  can  be  analyzed 
in  a  single  run,  and  facilitating 
interpretation  of  the  results.  Over 
the  course  of  the  past  few  years, 
several  alternative  approaches  to 
direct  sequencing  have  been 
explored,  including  atomic-resolu- 
tion molecular  scanning,  single- 
molecule  detection  of  individual 
bases,  and  mass  spectrometry  of 
DNA  fragments. 

All  of  these  alternatives  look  promising 
in  the  long  term,  but  mass  spectrometry  has 
perhaps  demonstrated  the  greatest  near-term 
potential.  Mass  spectrometry  measures  the 
masses  of  ionized  DNA  fragments  by  record- 
ing their  time-of-flight  in  vacuum.  It  would 
therefore  replace  traditional  gel  electrophore- 
sis as  the  last  step  in  a  conventional  sequenc- 
ing scheme.  Routine  application  of  this  tech- 
nique still  lies  in  the  future,  but  fragments  of 
up  to  500  bases  have  been  analyzed,  and  prac- 
tical systems  based  on  high-resolution  mass 
separations  of  DNA  fragments  of  fewer  than 
100  bases  are  currently  being  developed  at 
several  universities  and  national  laboratories. 

Another  innovative  sequencing  method 
is  under  investigation  at  Los  Alamos.  As 
depicted  in  Figure  9c,  each  of  the  four  bases 
(A,  T,  C,  G)  in  a  single  strand  of  DNA 
receives  a  different  fluorescent  label,  then  the 
bases  are  enzymatically  detached,  one  at  a 
time.  The  characteristic  fluorescence  is 
detected  by  a  laser  system,  thereby  yielding 
the  sequence,  base  by  base.  This  approach  is 
beset  by  major  technical  challenges,  and  direct 
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sequencing  has  not  yet  been  achieved.  But 
the  potential  benefits  are  great,  and  much  of 
the  instrumentation  for  sensitive  detection  of 
fluorescence  signals  has  already  proved 
useful  for  molecular  sizing  in  mapping 
applications. 

Dealing  with   the   data 

Among  the  less  visible  challenges  of  the 
Human  Genome  Project  is  the  daunting 
prospect  of  coping  with  all  the  data  that  suc- 
cess implies.  Appropriate  information  sys- 
tems are  needed  not  only  during  data  acqui- 
sition, but  also  for  sophisticated  data  analysis 
and  for  the  management  and  public  distribu- 
tion of  unprecedented  quantities  of  biological 
information.  Further,  because  much  of  the 
challenge  is  interpreting  genomic  data  and 
making  the  results  available  for  scientific  and 
technological  applications,  the  challenge 
extends   not  just   to   the   Human   Genome 


Project,  but  also  to  the  microbial  genome 
program  and  to  public-  and  private-sector 
programs  focused  on  areas  such  as  health 
effects,  structural  biology,  and  environmental 
remediation.  Efforts  in  all  these  areas  are  the 
mandate  of  the  DOE  genome  informatics 
program,  whose  products  are  already  widely 
used  in  genome  laboratories,  general  molecu- 
lar biology  and  medical  laboratories,  biotech- 
nology companies,  and  biopharmaceutical 
companies  around  the  world. 

The  roles  of  laboratory  data  acquisition 
and  management  systems  include  the  con- 
struction of  genetic  and  physical  maps,  DNA 
sequencing,  and  gene  expression  analysis. 
These  systems  typically  comprise  databases 
for  tracking  biological  materials  and  experi- 
mental procedures,  software  for  controlling 
robots  or  other  automated  systems,  and 
software  for  acquiring  laboratory  data  and 
presenting  it  in  useful  form.  Among  such 
systems   are   physical    mapping   databases 
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developed  at  Livermore  and  Los  Alamos, 
robot  control  software  developed  at  Berkeley 
and  Livermore,  and  DNA  sequence  assembly 
software  developed  at  the  University  of 
Arizona.  These  systems  are  the  keys  to  effi- 
cient, cost-effective  data  production  in  both 
DOE  laboratories  and  the  many  other  labo- 
ratories that  use  them. 

The  interpretation  of  map  and  sequence 
data  is  the  job  of  data  analysis  systems. 
These  systems  typically  include  task-specific 
computational  engines,  together  with  graph- 
ics and  user-friendly  interfaces  that  invite 
their  use  by  biologists  and  other  non-com- 
puter scientists.  The  genome  informatics 
program  is  the  world  leader  in  developing 
automated  systems  for  identifying  genes 
in  DNA  sequence  data  from  humans  and 
other  organisms,  supporting  efforts  at  Oak 
Ridge  National  Laboratory  and  elsewhere. 
The  Oak  Ridge-developed  GRAIL  system, 
illustrated  in  Figure  10,  is  a  world-standard 
gene  identification  tool.  In  1995  alone,  more 
than  180  million  base  pairs  of  DNA  were 
analyzed  with  GRAIL. 

A  third  area  of  informatics  reflects,  in  a 
sense,  the  ultimate  product  of  the  Human 
Genome  Project— information  readily  avail- 
able to  the  scientific  and  lay  communities. 


Public  resource  databases  must  provide  data 
and  interpretive  analyses  to  a  worldwide 
research  and  development  community.  As 
this  community  of  researchers  expands  and 
as  the  quantity  of  data  grows,  the  chal- 
lenges of  maintaining  accessible  and  useful 
databases  likewise  increase.  For  example,  it 
is  critical  to  develop  scientific  databases  that 
"interoperate,"  sharing  data  and  protocols  so 
that  users  can  expect  answers  to  complex 
questions  that  demand  information  from  geo- 
graphically distributed  data  resources.  As 
the  genome  project  continues  to  provide  data 
that  interlink  structural  and  functional  bio- 
chemistry, molecular,  cellular,  and  develop- 
mental biology,  physiology  and  medicine,  and 
environmental  science,  such  interoperable 
databases  will  be  the  critical  resources 
for  both  research  and  technology  develop- 
ment. The  DOE  genome  informatics  pro- 
gram is  crucial  to  the  multiagency  effort  to 
develop  just  such  databases.  Systems  now 
in  place  include  the  Genome  Database  of 
human  genome  map  data  at  Johns  Hopkins 
University,  the  Genome  Sequence  DataBase 
at  the  National  Center  for  Genome 
Resources  in  Santa  Fe,  and  the  Molecular 
Structure  Database  at  Brookhaven  National 
Laboratoiy.   :* 
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THE  HUMAN  GENOME  PROJECT  is 
rich  with  promise,  but  also 
fraught  with  social  implications. 
We  expect  to  learn  the  under- 
lying causes  of  thousands  of 
genetic  diseases,  including  sickle  cell  anemia, 
Tay-Sachs  disease,  Huntington  disease, 
myotonic  dystrophy,  cystic  fibrosis,  and 
many  forms  of  cancer— and  thus  to  predict 
the  likelihood  of  their  occurrence  in  any  indi- 
vidual. Likewise,  genetic  information  might 
be  used  to  predict  sensitivities  to  various 
industrial  or  environmental  agents.  The  dan- 
gers of  misuse  and  the  potential  threats  to 
personal  privacy  are  not  to 
be  taken  lightly. 

In  recognition  of  these 
important  issues,  both  the 
DOE  and  the  National 
Institutes  of  Health  devote  a 
portion  of  their  resources  to 
studies  of  the  ethical,  legal, 
and  social  implications 
(ELSI)  of  human  genome 
research.  Perhaps  the  most 
critical  of  social  issues  are 
the  questions  of  privacy  and 
fair  use  of  genetic  informa- 
tion. Most  observers  agree 
that  personal  knowledge  of 
genetic  susceptibility  can  be 
expected  to  serve  us  well,  opening  the  door  to 
more  accurate  diagnoses,  preventive  inter- 
vention, intensified  screening,  lifestyle 
changes,  and  early  and  effective  treatment. 
But  such  knowledge  has  another  side,  too: 
the  risk  of  anxiety,  unwelcome  changes  in 
personal   relationships,   and  the  danger   of 
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stigmatization.  Consider,  for  example,  the 
impact  of  information  that  is  likely  to  be 
incomplete  and  indeterminate  (say,  an  indica- 
tion of  a  25  percent  increase  in  the  risk  of 
cancer).  And  further,  if  handled  carelessly, 
genetic  information  could  threaten  us  with 
discrimination  by  potential  employers  and 
insurers.  Other  issues  are  perhaps  less 
immediate  than  these  personal  concerns,  but 
they  are  no  less  challenging.  How,  for  exam- 
ple, are  the  "products"  of  the  Human 
Genome  Project  to  be  patented  and  commer- 
cialized? How  are  the  judicial,  medical, 
and  educational  communities  —  not  to  men- 
tion the  public  at  large  — to  be  effectively 
educated  about  genetic  research  and  its 
implications? 

To  confront  all  these  issues,  the  NIH- 
DOE  Joint  Working  Group  on  Ethical, 
Legal,  and  Social  Implications  of  Human 
Genome  Research  was  created  in  1990  to 
coordinate  ELSI  policy  and  research 
between  the  two  agencies.  One  focus  of 
DOE  activity  has  been  to  foster  educational 
programs  aimed  both  at  private  citizens  and 
at  policy-makers  and  educators.  Fruits  of 
these  efforts  include  radio  and  television  doc- 
umentaries, high  school  curricula  and  other 
educational  material,  and  science  museum 
displays.  In  addition,  the  DOE  has  concen- 
trated on  issues  associated  with  privacy  and 
the  confidentiality  of  genetic  information,  on 
workplace  and  commercialization  issues 
(especially  screening  for  susceptibilities  to 
environmental  or  workplace  agents),  and  on 
the  implications  of  research  findings  regard- 
ing the  interactions  among  multiple  genes 
and  environmental  influences. 
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Whereas  the  issues  raised  by  modern 
genome  research  are  among  the  most  chal- 
lenging we  face,  they  are  not  unprecedented. 
Issues  of  privacy,  knotty  questions  of  how 
knowledge  is  to  be  commercialized,  problems 
of  deaHng  with  probabilistic  risks,  and  the 
imperatives  of  education  have  all  been  con- 
fronted before.   As  usual,  defensible  perspec- 


tives and  reasonable  arguments,  even  pre- 
cious rights,  exist  on  opposing  sides  of  every 
issue.  It  is  a  balance  that  must  be  sought. 
Accordingly,  further  study  is  needed,  as  well 
as  continuing  efforts  to  promote  public  aware- 
ness and  understanding,  as  we  strive  to  define 
policies  for  the  intelligent  use  of  the  profound 
knowledge  we  seek  about  ourselves. 


THE  AGE  OF  DISCOVERY  was  the  age 
of  da  Gama,  Columbus,  and 
Magellan,  an  era  when  European 
civilization  reached  out  to  the 
Far  East  and  thus  filled  many  of 
the  voids  in  its  map  of  the  world.  But  in  a 
larger  sense,  we  have  never  ceased  from  our 
exploration  and  discovery.  Science  has  been 
unstinting  over  the  ages  in  its  efforts  to 
complete  our  intellectual  picture  of  the  uni- 
verse. In  this  century,  our  explorations  have 
extended  from  the  subatomic  to  the  cosmic,  as 
we  have  mapped  the  heavens  to  their  farthest 
reaches  and  charted  the  properties  of  the 
most  fleeting  elementary  particles.  Nor  have 
we  neglected  to  look  inward,  seeking,  as  it 
were,  to  define  the  topography  of  the  human 
body.  Beginning  with  the  first  modern 
anatomical  studies  in  the  sixteenth  century, 
we  have  added  dramatically  to  our  picture  of 
human  anatomy,  physiology,  and  biochem- 
istry. The  Human  Genome  Project  is  thus  the 
next  stage  in  an  epic  voyage  of  discovery  —  a 
voyage  that  will  bring  us  to  a  profound  under- 
standing of  human  biology. 

In  an  important  way,  though,  the 
genome  project  is  very  different  from  many  of 
our  exploratory  adventures.  It  is  spurred  by 
a  conviction  of  practical  value,  a  certainty 
that  human  benefits  will  follow  in  the  wake  of 
success.  The  product  of  the  Human  Genome 
Project  will  be  an  enormously  rich  biological 


database,  the  key  to  tracking  down  every 
human  gene  — and  thus  to  unveiling,  and 
eventually  to  subverting,  the  causes  of  thou- 
sands of  human  diseases.  The  sequence  of 
our  genome  will  ultimately  allow  us  to  unlock 
the  secrets  of  life's  processes,  the  biochemical 
underpinnings  of  our  senses  and  our  memory, 
our  development  and  our  aging,  our  similari- 
ties and  our  differences. 

It  has  further  been  said  that  the  Human 
Genome  Project  is  gtuiranteed  to  succeed:  Its 
goal  is  nothing  more  assuming  than  a 
sequence  of  three  billion  characters.  And  we 
have  a  very  good  idea  of  how  to  read  those 
characters.  Unlike  perilous  voyages  or 
searches  for  unknown  subatomic  particles, 
this  venture  is  cissured  of  its  goal.  But 
beyond  a  detailed  picture  of  human  DNA,  no 
one  can  predict  the  form  success  will  take. 
The  genome  project  itself  offers  no  promises 
of  cancer  cures  or  quick  fixes  for  Alzheimer's 
disease,  no  detailed  understanding  of  genius 
or  schizophrenia.  But  if  we  are  ever  to 
uncover  the  mysteries  of  carcinogenesis,  if 
we  are  ever  to  know  how  biochemistry  con- 
tributes to  mental  illness  and  dementia,  if  we 
ever  hope  to  really  understand  the  processes 
of  growth  and  development,  we  must  first 
have  a  detailed  map  of  the  genetic  landscape. 
That's  what  the  Hurtian  Genome  Project 
promises.  In  a  way,  it's  a  rather  prosaic  step, 
but  what  lies  beyond  is  breathtaking. 
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The  World  Wide  Web  offers  the  easiest  path  to  current  news  about  the  Human  Genome  Project 
Good  places  to  start  include  the  following: 

•  DOE  Human  Genome  Program  — http://www.er.doe.gov/production/oher/hug_top.html 

•  NIH  National  Center  for  Human  Genome  Research  — http://www.nchgr.nih.gov 

•  Human  Genome  Management  Information  System  at  Oak  Ridge  National  Laboratory  — 
http://www.ornl. govArechResources/Human_Genome/home. html 

•  Lawrence  Berkeley  National  Laboratory  Human  Genome  Center — 
http://www-hgc.lbI.gov/GenomeHome.html 

•  Lawrence  Livermore  National  Laboratory  Human  Genome  Center- 
http://www-bio.llnl.gov/bbrp/genome/genome.html 

•  Los  Alamos  National  Laboratory  Center  for  Human  Genome  Studies  — 
http://www-ls.lanI.gov/LSwelcome.html 

•  The  Genome  Database  at  Johns  Hopkins  University  School  of  Medicine  — 
http://gdbwww.gdb.org/ 

•  The  National  Center  for  Genome  Resources  —  http://www.ncgr.org/ 
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The  U.S.  Human  Genome  Project  is  part  of  an  international  effort  to  develop  genetic  and 
physical  maps  and  determine  the  DNA  sequence  of  the  human  genome  and  the  genomes  of 
several  model  organisms.  Thanks  to  advances  in  technology  and  a  tightly  focused  effort,  the 
project  is  on  track  with  respect  to  its  initial  5-year  goals.  Because  3  years  have  elapsed  since 
these  goals  were  set,  and  because  a  much  more  sophisticated  and  detailed  understanding  of 
what  needs  to  be  done  and  how  to  do  it  is  now  available,  the  goals  have  been  refined  and 
extended  to  cover  the  first  8  years  (through  September,  1998)  of  the  15  year  genome 
initiative. 

In  1990,  the  Human  Genome  Programs  of  the  National  Institutes  of  Health  (NTH)  and  the 
Department  of  Energy  (DOE)  developed  a  joint  research  plan  with  specific  goals  for  the  first 
5  years  (FY  1991  -  1995)  of  the  U.S.  genome  project.  It  has  served  as  a  valuable  guide  for 
both  the  research  community  and  the  agencies'  administrative  staff  in  developing  and 
executing  the  genome  project  and  assessing  its  progress  for  the  past  3  years  Great  strides 
have  been  made  toward  the  achievement  of  the  initial  set  of  goals,  particularly  with  respect 
to  constructing  detailed  human  genetic  maps,  improving  physical  maps  of  the  human  genome 
and  the  genomes  of  certain  model  organisms,  developing  improved  technology  for  DNA 
sequencing  and  information  handling,  and  defining  the  most  urgent  set  of  ethical,  legal  and 
social  issues  associated  with  the  acquisition  and  use  of  large  amounts  of  genetic  information. 

Progress  toward  achieving  the  first  set  of  goals  for  the  genome  project  appears  to  be  on 
schedule,  or  in  some  instances,  even  ahead  of  schedule.  Furthermore,  technological 
improvements  that  could  not  have  been  anticipated  in  1 990  have  in  some  areas  changed  the 
shape  of  the  project  and  allowed  more  ambitious  approaches  Earlier  this  year,  it  was 
therefore  decided  to  update  and  extend  the  initial  goals  to  address  the  scope  of  genome 
research  beyond  the  completion  of  the  original  5-year  plan.  A  major  purpose  of  revisiting  the 
plan  is  to  inform  and  provide  a  new  guide  to  all  participants  in  the  genome  project  about  the 
project's  goals.  To  obtain  the  advice  needed  to  develop  the  extended  goals,  NTH  and  DOE 
held  a  series  of  meetings  with  a  large  number  of  scientists  and  other  interested  scholars  and 
representatives  of  the  public,  including  many  who  previously  had  not  been  direct  participants 
in  the  genome  project.  Reports  of  all  these  meetings  are  available  from  the  Office  of 
Communications  of  the  National  Human  Genome  Research  Institute,  and  the  Human 
Genome  Management  Information  System  of  the  DOE  (2,3)  Finally,  a  group  of 
representative  advisors  from  NIH  and  DOE  drafted  a  set  of  new,  extended  goals  for 
presentation  to  the  National  Advisory  Council  for  Human  Genome  Research  of  the  NIH  and 
the  Health  and  Environmental  Research  Advisory  Committee  of  the  DOE  These  bodies  have 
approved  this  document  as  a  statement  of  their  advice  to  the  two  agencies,  and  the  following 
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represents  the  goals  for  FY  1994  -  1998  (i.e.  October  1,  1993  -  September  30,  1998). 

General  Principles 

Several  general  observations  underlie  the  specific  goals  described  here.  The  first  observation 
is  that  successfial  development  of  new  technology  for  genomic  and  genetic  research  has  been 
essential  to  the  achievements  of  the  program  to  date  and  will  continue  to  be  critical  in  the 
fiiture.  It  was  clearly  recognized,  both  in  the  1988  NRC  report  (4)  and  in  the  first  NIH-DOE 
plan,  that  attainment  of  the  ambitious  goals  originally  set  for  the  genome  project  would 
require  significant  technological  advances  in  all  areas  such  as  mapping,  sequencing, 
informatics,  and  gene  identification.  As  the  genome  project  has  proceeded,  progress  along  a 
broad  range  of  technological  fi'onts  has  been  conspicuous  Among  the  most  notable  of  these 
developments  have  been  (i)  new  types  of  genetic  markers,  such  as  microsatellites,  that  can  be 
assayed  by  the  polymerase  chain  reaction  (PCR);  (ii)  improved  vector  systems  for  cloning 
large  DNA  fi-agments  and  better  experimental  strategies  and  computational  methods  for 
assembling  those  clones  into  large,  overlapping  sets  (contigs)  that  compose  usefijl  physical 
maps;  (iii)  the  definition  of  the  sequence  tagged  site  (STS)  (5)  as  a  common  unit  of  physical 
mapping;  and  (iv)  improved  technology  and  automation  for  DNA  sequencing.  Further 
substantial  improvements  in  technology  are  needed  in  all  areas  of  genome  research, 
especially  in  DNA  sequencing,  if  the  project  is  to  stay  on  schedule  and  meet  the  demanding 
goals  that  are  being  set. 

A  second  general  observation  concerns  an  evolution  in  the  levels  of  biological  organization 
at  which  genomic  research  will  likely  fiinction  over  the  next  few  years  Initially,  attention  was 
focused  at  the  chromosome  as  the  basic  unit  of  genome  analysis.  Large-scale  mapping 
efforts,  in  particular,  were  directed  at  construction  of  chromosome  maps.  The  sophisticated 
genetic  linkage  maps  now  available  and  the  detailed  physical  maps  that  are  being  produced 
are  clear  measures  of  the  success  of  that  approach.  However,  other  units  of  study  for  the 
human  genome  project  will  also  have  increasing  usefialness  in  the  fiiture.  Therefore,  fiarther 
mapping  efforts  directed  at  both  larger  and  smaller  targets  should  be  encouraged.  At  one  end 
of  the  scale,  "whole  genome"  mapping  efforts,  in  which  the  entire  genome  is  efficiently 
analyzed,  have  become  feasible  with  developments  in  PCR  application  and  robotics  These 
approaches  generally  produce  relatively  low  resolution  maps  with  current  technology  At  the 
other  end  of  the  scale,  increasing  attention  needs  to  be  paid  to  detailed  mapping,  sequencing 
and  aimotation  of  regions  on  the  order  of  one  to  a  few  megabases  in  size  Although  small  in 
comparison  to  the  whole  genome,  a  megabase  is  still  large  in  comparison  to  the  capabilities 
of  conventional  molecular  genetic  analysis.  Thus,  development  of  efficient  technology  for 
approaching  detailed  analysis  of  several  megabase  sections  of  the  genome  will  provide  a 
useful  bridge  between  conventional  genetics  and  genomics,  as  well  as  a  foundation  for 
innovation  from  which  future  methods  for  analysis  of  larger  regions  may  arise. 

Third,  a  goal  for  identifying  genes  within  maps  and  sequences,  that  was  implicit  in  the 
original  plan,  has  now  been  made  explicit.  The  progress  already  made  on  the  original  goals, 
combined  with  promising  new  approaches  to  gene  identification,  allow  this  element  of 
genome  analysis  to  be  given  greater  visibility.  This  increased  emphasis  on  gene  identification 
will  greatly  enrich  the  maps  that  are  produced. 

It  must  also  be  noted  here,  that,  as  in  the  original  five-year  plan,  these  goals  again  assume  a 
funding  level  for  the  U.S.  genome  program  of  $200  million  annually,  adjusted  for  inflation. 
As  the  detailed  cost  analysis  for  the  first  five-year  plan  was  performed  in  1991,  a  cost  of 
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living  increase  must  be  added  for  ail  years  beyond  FY  1 99 1 .  This  funding  level  has  not  yet 
been  achieved  (see  Table  1). 


Table  1:  Budget  of  the  Human  Genome  Project  for  the  NTH  and  the  DOE  (millions  of 
dollars)  (Note:  Budgets  for  1994  and  1995  have  not  yet  been  determined). 

Fiscal  Year                NIH 

DOE 

Total 

1991 

Projection  of 
Needs 

;            1991                      87.4 

47.4 

134.8 

135.1 

1992                     104.8 

61.4 

166.2 

169.2 

1993                     106.1 

64.5 

170.6 

218.9 

1994 

- 

i 

246.8 

r           1995 

- 

1 

259.9 

International  Aspects 

The  Human  Genome  Project  is  truly  international  in  scope,  as  the  original  planners 
envisioned  it.  Its  success  to  date  has  been  possible  because  of  major  contributions  from  many 
countries  and  the  extensive  sharing  of  information  and  resources.  It  is  hoped  and  anticipated 
that  this  spirit  of  international  cooperation  and  sharing  will  continue.  This  coordination  has 
been  achieved  largely  by  scientist  to  scientist  interaction,  facilitated  by  the  Human  Genome 
Organization  (HUGO),  which  has  taken  on  responsibility  for  some  aspects  of  the 
management  of  the  international  chromosome  workshops  in  particular  These  workshops 
have  served  to  encourage  collaboration  and  the  sharing  of  information  and  resources  and  to 
facilitate  the  expeditious  completion  of  chromosome  maps. 

Several  notable  individual  international  collaborations  have  marked  the  genome  project  so 
far.  One  is  the  United  States  -  United  Kingdom  collaboration  on  the  sequencing  of  the 
Caenorhabditis  elegans  genome.  Scientists  at  the  Los  Alamos  National  Laboratory  are 
collaborating  with  Australian  colleagues  to  develop  a  physical  map  of  chromosome  1 6,  and 
investigators  at  the  Lawrence  Livermore  National  Laboratory  with  Japanese  scientists  on  a 
high  resolution  physical  map  of  chromosome  21.  Other  joint  eflForts  include  the  collaboration 
between  the  NIH  and  the  Centre  d'Etude  du  Polymorphism  Humain  (CEPH)  on  the  genetic 
map  of  the  human  genome  and  the  Whitehead/Massachusetts  Institute  of 
Technology-Genethon  collaboration  on  the  whole  genome  approach  to  the  human  physical 
map  These  are  but  examples  of  the  myriad  interrelationships  that  have  formed,  generally 
spontaneously,  among  participating  scientists. 

Specific  Goals 

Genetic  Map 

The  2-5  cM  human  genetic  map  of  highly  informative  markers  called  for  in  the  original  goals 
is  expected  to  be  completed  on  time.  However,  improvements  to  make  the  map  more  useful 
and  accessible  will  still  be  needed.  If  the  field  develops  as  predicted,  there  will  be  an 
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increasing  demand  for  technology  that  allows  the  nonexpert  to  type  families  rapidly  for 
medical  research  purposes.  In  addition,  to  study  complex  genetic  diseases,  there  is  a  need  to 
be  able  to  easily  test  large  numbers  of  individuals  for  many  markers  simultaneously  In  the 
long  run  polymorphic  markers  that  can  be  screened  in  a  more  automated  fashion  and 
methods  of  gene  mapping  that  obviate  the  need  for  a  standard  set  of  polymorphic  markers 
are  also  desirable. 

Goals 


• 


Complete  the  2-5  cM  map  by  1995 

•  Develop  technology  for  rapid  genotyping 

•  Develop  markers  that  are  easier  to  use 

•  Develop  new  mapping  technologies 

Physical  Map 

An  STS-based  physical  map  of  the  human  genome  is  expected  to  be  available  in  the  next  2-3 
years,  with  some  areas  mapped  in  more  detail  than  others  and  an  average  interval  between 
markers  of  approximately  300  kilobases  However,  such  a  map  will  not  likely  be  sufficiently 
detailed  to  provide  a  substrate  for  sequencing  or  to  be  optimally  useful  to  investigators 
searching  for  disease  genes.  The  original  goal  of  a  physical  map  with  STS  markers  at 
intervals  of  100  kb  remains  realistic  and  useful  and  would  serve  both  sequencers  and 
mappers.  Using  widely  available  methods,  a  molecular  biologist  can  isolate  a  gene  that  is 
within  100  kb  of  a  mapped  marker,  and  a  sequencer  can  use  such  a  map  as  the  basis  for 
preparing  the  DNA  for  sequencing.  To  the  extent  that  they  do  not  introduce  statistical  bias, 
the  use  of  STS's  with  added  value  (such  as  those  derived  from  polymorphic  markers  or 
genes)  is  encouraged,  because  such  markers  add  to  the  usefulness  of  the  map. 

Goal 

•  Complete  an  STS  map  of  the  human  genome  at  a  resolution  of  100  kb. 

Physical  maps  of  greater  than  100  kb  resolution  are  needed  for  DNA  sequencing,  for  the 
purpose  of  finding  genes  and  for  other  biological  purposes.  While  a  variety  of  options  are 
being  explored  for  creating  such  maps,  the  optimal  approach  is  by  no  means  clear.  There  is  a 
need  to  develop  new  strategies  for  high  resolution  physical  mapping  as  well  as  new  cloning 
systems  that  are  well  integrated  with  advanced  sequencing  technology.  Technology  for 
sequencing  is  evolving  rapidly.  Therefore,  preparation  of  sequence-ready  sets  of  clones 
should  be  closely  associated  with  an  imminent  intent  to  sequence 

There  is  a  pressing  need  for  clone  libraries  with  improved  stability  and  lower  chimerism  and 
other  artifacts  and  a  need  for  better  technology  for  traveling  from  one  STS  to  the  next.  A 
greater  accessibihty  to  clone  libraries  should  also  be  encouraged. 

DNA  Sequencing 

Although  the  goal  of  sequencing  DNA  at  a  cost  of  $0.50  per  base  pair  may  be  met  by  1996 
as  originally  projected,  the  rate  at  which  DNA  can  be  sequenced  will  not  be  sufficient  for 
sequencing  the  whole  genome  Priority  should  be  given  during  the  next  five  years  to 
increasing  sequencing  capacity  by  increasing  the  number  of  groups  oriented  toward 
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large-scale  production  sequencing.  Substantial  new  technology  that  will  allow  sequencing  at 
higher  rates  and  lower  costs  is  also  needed:  both  evolutionary  technology  developed  from 
improvements  in  current  gel-based  approaches  and  revolutionary  technology  based  on  new 
principles.  These  developments  will  only  occur  if  significantly  greater  financial  resources  can 
be  invested  in  this  area.  It  is  estimated  that  an  immediate  investment  of  $100  million  per  year 
will  be  needed  for  sequencing  technology  alone,  to  allow  the  human  genome  to  be  sequenced 
by  the  year  2005. 

Goals 

•  Develop  efficient  approaches  to  sequencing  one-  to  several-  megabase  regions  of  DNA 
of  high  biological  interest. 

•  Develop  technology  for  high  throughput  sequencing,  focusing  on  systems  integration 
of  all  steps  from  template  preparation  to  data  analysis. 

•  Build  up  sequencing  capacity  to  a  collective  rate  of  50  Mb  per  year  by  the  end  of  the 
period.  This  rate  should  result  in  an  aggregate  of  80  Mb  of  DNA  sequence  completed 
by  the  end  of  FY  1998. 

The  standard  model  organisms  should  be  sequenced  as  rapidly  as  possible,  with  Escherichia 
coli  and  Saccharomyces  cerevisiae  completed  by  1998  or  earlier  and  C.  elegans  nearing 
completion  by  1998.  It  is  often  advantageous  to  sequence  the  corresponding  regions  of 
human  and  mouse  DNA  side-by-side  in  areas  of  high  biological  interest.  The  sequencing  of 
full-length,  mapped  complementary  DNA  (cDNA)  molecules  is  useful,  especially  if  it  is 
associated  with  technological  innovation  extensible  to  genomic  sequencing. 

The  measurement  of  the  cost  of  sequencing  is  complex  and  fraught  with  many  uncertainties 
due  to  the  diversity  of  approaches  being  used  However,  we  need  to  continue  to  reduce 
costs,  as  well  as  improve  our  ability  to  assess  the  accuracy  of  the  sequence  produced.  This 
latter  point  must  be  addressed  in  future  sequencing  efforts.  Cost  will  be  highly  dependent  on 
the  level  of  accuracy  achieved. 

Gene  Identification 

Identification  of  all  the  genes  in  the  human  genome  and  in  the  genomes  of  certain  model 
organisms  is  an  implicit  part  of  the  Human  Genome  Project.  Although  the  previous  5-year 
plan  did  not  explicitly  identify  this  activity  with  a  specific  goal,  progress  in  mapping  and  in 
technology  now  make  it  desirable  to  do  so.  With  both  genetic  and  physical  maps  of  the 
human  genome  and  the  genomes  of  certain  model  organisms  becoming  available  and  large 
amounts  of  sequence  data  beginning  to  appear,  it  is  important  to  develop  better  methods  for 
identifying  all  the  genes  and  incorporating  all  known  genes  onto  the  physical  maps  and  the 
DNA  sequences  that  are  produced.  This  information  will  make  the  maps  most  useful  to 
scientists  studying  the  role  of  genes  in  health  and  disease.  While  many  promising  approaches 
are  being  explored,  more  development  is  needed  in  this  area. 

Goal 

•  Develop  efficient  methods  of  identifying  genes  and  for  placement  of  known  genes  on 
physical  maps  or  sequenced  DNA. 

Technology  Development 
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Development  of  new  and  improved  technology  is  vital  to  the  genome  project  Certain 
technologies,  such  as  automation  and  robotics,  cut  across  many  areas  of  genome  research 
and  need  particular  attention.  Cooperation  in  technology  development  should  be  encouraged 
where  possible,  because  it  is  likely  to  be  more  effective  and  efficient  than  competition  and 
duplication.  The  technology  developed  must  be  expandable  and  exportable,  the  long  term 
goal  being  to  create  technology  that  will  be  available  in  many  basic  science  laboratories  and 
allow  the  efficient  sequencing  of  other  genomes.  Technology  development  is  costly  and  has 
not  been  sufficiently  funded. 

Goal 

•  Substantially  expand  support  of  innovative  technological  developments  as  well  as 
improvements  in  current  technology  for  DNA  sequencing  and  to  meet  the  needs  of  the 
Human  Genome  Project  as  a  whole. 

Model  Organisms 

Excellent  progress  has  been  made  on  the  mouse  genetic  map,  the  Drosophila  physical  map, 
as  well  as  the  sequencing  of  the  DNA  of  £.  coli,  S.  cerevisiae  and  C.  elegans.  Many  of  the 
original  goals  for  this  area  are  likely  to  be  exceeded.  Completion  of  the  mouse  map  and 
sequencing  of  all  the  selected  model  organism  genomes  continue  to  be  high  priorities.  The 
current  emphasis  for  sequencing  of  mouse  DNA  should  be  placed  on  sequencing  of  selected 
regions  of  high  biologic  interest  side-by-side  with  the  corresponding  human  DNA. 

Goals 

•  Finish  an  STS  map  of  the  mouse  at  300  Kb  resolution 

•  Finish  the  sequence  of  the  E.  coli  and  S.  cerevisiae  genomes  by  1998  or  earlier 

•  Continue  sequencing  C.  elegans  and  Drosophila  genomes,  with  the  aim  of  bringing  C. 
elegans  to  near  completion  by  1998 

•  Sequence  selected  segments  of  mouse  DNA  side  by  side  with  corresponding  human 
DNA  in  areas  of  high  biological  interest 

Informatics 

In  order  to  collect,  organize  and  interpret  the  large  amounts  of  complex  mapping  and 
sequencing  data  produced  by  the  Human  Genome  Project,  appropriate  algorithms,  software, 
database  tools  and  operational  infi-astructure  are  required.  The  success  of  the  genome  project 
will  depend,  in  large  part,  on  the  ease  with  which  biologists  can  gain  access  to  and  use  the 
information  produced.  Although  considerable  progress  has  been  made  in  this  area  since  the 
beginning  of  the  genome  project,  there  is  a  continuing  need  for  improvements  to  stay  current 
with  evolving  requirements  As  the  amount  of  information  increases,  the  demand  for  it  and 
the  need  for  convenient  access  increase  also.  Thus,  data  management,  data  analysts  and  data 
distribution  remain  major  goals  for  the  fliture. 

Goab 

•  Continue  to  create,  develop  and  operate  databases  and  database  tools  for  easy  access 
to  data,  including  effective  tools  and  standards  for  data  exchange  and  links  among 
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databases 

•  Consolidate,  distribute  and  continue  to  develop  effective  software  for  large-scale 
genome  projects 

•  Continue  to  develop  tools  for  comparing  and  interpreting  genome  information 

Ethical,  Legal  and  Social  Implications  (ELSI) 

The  ELSI  components  of  the  Human  Genome  programs  of  NIH  and  DOE  are  strongly 
connected  with  genomic  research,  so  that  policy  discussions  and  the  recommendations 
developed  are  couched  in  the  reality  of  the  science.  To  date,  the  focus  of  the  ELSI  programs 
has  been  on  the  most  immediate  potential  applications  in  society  of  genome  research.  Four 
areas  were  identified  by  advisors  to  the  ELSI  program  for  initial  emphasis:  privacy  of  genetic 
information,  safe  and  effective  introduction  of  genetic  information  in  the  clinical  setting, 
fairness  in  the  use  of  genetic  information  and  professional  and  public  education.  The  program 
gives  strong  emphasis  to  understanding  the  ethnic,  cultural,  social  and  psychological 
influences  that  must  inform  policy  development  and  service  delivery.  Initial  policy  options  for 
genetic  family  studies,  clinical  genetic  services,  and  health  care  coverage  have  been 
developed  and  reports  on  a  range  of  urgent  issues  are  expected  by  1995. 

As  the  genome  project  progresses,  the  need  to  prepare  for  broad  public  impact  becomes 
increasingly  important.  Policies  are  needed  to  anticipate  the  potential  consequences  of 
widespread  use  of  genetic  tests  for  common  conditions,  such  as  genetic  predisposition  to 
certain  cancers  or  genetic  susceptibility  to  certain  environmental  agents.  In  addition,  as  the 
genetic  elements  of  behavioral  and  other  non-disease  related  traits  are  better  understood, 
increased  educational  efforts  will  be  needed  to  prevent  stigmatization  or  discrimination  based 
on  these  traits.  Continued  emphasis  on  public  and  professional  education  at  all  levels  will  be 
critical  to  achieving  these  goals.  Mechanisms  for  developing  policy  options  that  build  on  the 
current  research  portfolio  and  actively  involve  the  public,  the  relevant  professions  and  the 
scientific  community  need  to  be  developed. 

Goals 

•  Continue  to  identify  and  define  issues  and  develop  policy  options  to  address  them 

•  Develop  and  disseminate  policy  options  regarding  genetic  testing  services  with 
widespread  potential  use 

•  Foster  greater  acceptance  of  human  genetic  variation 

•  Enhance  and  expand  public  and  professional  education  that  is  sensitive  to  sociocultural 
and  psychological  issues 

Training 

There  is  a  continuing  need  for  individuals  highly  trained  in  the  interdisciplinary  sciences 
related  to  genome  research.  The  original  goal  for  supporting  600  trainees  per  year  proved  to 
be  unattainable,  because  the  capacity  to  train  so  many  individuals  in  interdisciplinary  sciences 
did  not  exist  However,  now  that  a  number  of  genome  centers  have  been  established,  it  is 
anticipated  that  training  programs  will  expand.  Although  no  numerical  goal  is  specified, 
expansion  of  training  activities  should  be  encouraged,  provided  standards  are  kept  high. 
Quality  is  more  important  than  quantity. 

Goal 
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•  Continue  to  encourage  training  of  scientists  in  interdisciplinary  sciences  related  to 
genome  research 

Technology  Transfer 

Technology  transfer  is  already  occurring  to  a  remarkable  extent,  as  evidenced  by  the  number 
of  genome-related  companies  that  are  forming.  Many  interactions  and  collaborations  have 
been  established  between  genome  researchers  and  the  private  sector  In  addition  to  the  need 
to  transfer  technology  out  of  centers  of  genome  research,  there  is  also  a  need  to  increase  the 
transfer  of  technology  from  other  fields  into  the  genome  centers  Increased  cooperation  with 
industry,  as  well  as  continued  cooperation  between  the  agencies,  is  highly  desirable.  Care 
must  be  taken,  however,  to  avoid  conflicts  of  interest. 

Goal 

•  Encourage  and  enhance  technology  transfer  both  into  and  out  of  centers  of  genome 
research 

Outreach 

It  is  essential  to  the  success  of  the  Human  Genome  Project  that  the  products  of  genome 
research  be  made  available  to  the  community.  However,  only  a  subset  of  the  total 
information  is  likely  to  be  of  interest  at  any  one  time,  with  the  nature  of  the  subset  changing 
over  time.  Therefore,  it  is  desirable  to  have  flexible  distribution  systems  that  respond  quickly 
to  user  demand.  The  private  sector  is  best  suited  to  this  situation  and  has  begun  to  play  an 
active  and  highly  valued  role.  This  should  be  encouraged  and  facilitated  where  possible, 
including  the  provision  of  seed  fianding  in  some  instances 

The  NIH  and  DOE  genome  programs  have  adopted  a  rule  for  sharing  of  information:  Newly 
developed  data  and  materials  are  to  be  released  within  6  months  of  their  creation.  This  policy 
has  been  well  accepted  In  many  instances,  information  has  been  released  before  the  end  of 
the  six  months. 

Goab 

•  Cooperate  with  those  who  would  set  up  distribution  centers  for  genome  materials. 

•  Share  all  information  and  materials  within  6  months  of  their  development.  This  should 
be  accomplished  by  submission  to  public  databases  or  repositories,  or  both,  where 
appropriate. 

Conclusion 

To  date  the  Human  Genome  Project  has  experienced  gratifying  success.  However,  enormous 
challenges  remain.  The  technology  that  will  allow  the  sequencing  of  the  full  human  genome 
at  reasonable  cost  must  still  be  developed.  Major  support  of  research  in  this  area  is  essential 
if  the  genome  project  is  to  succeed  in  the  long  run.  The  new  goals  described  here  are 
designed  to  address  the  long-  and  short-term  needs  of  the  project. 

Although  there  is  still  debate  about  the  need  to  sequence  the  entire  genome,  it  is  now  more 
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widely  recognized  that  DNA  sequence  will  reveal  a  wealth  of  biological  information  that 
could  not  be  obtained  in  other  ways.  The  sequence  so  far  obtained  from  model  organisms  has 
demonstrated  the  existence  of  a  large  number  of  genes  not  previously  suspected.  For 
example,  almost  half  the  open  reading  frames  identified  in  the  genomic  DNA  of  C.  elegans 
appear  to  represent  previously  unidentified  genes.  Similar  results  have  been  observed  in  both 
S.  cerevisiae  and  E.  coli  genomic  DNA.  Comparative  sequence  analysis  has  also  confirmed 
the  high  degree  of  homology  between  genes  across  species.  It  is  clear  that  sequence 
information  represents  a  rich  source  for  future  investigation  Thus,  the  Human  Genome 
Project  must  continue  to  pursue  its  ultimate  goal,  namely  to  obtain  the  complete  human 
DNA  sequence.  At  the  same  time,  it  is  necessary  to  assure  that  technologies  are  developed 
that  will  allow  the  full  interpretation  of  the  DNA  sequence  once  it  is  available  In  order  to 
increase  emphasis  on  this  area,  an  explicit  goal  related  to  gene  identification  has  been  added. 

The  genome  project  has  already  had  a  profound  impact  on  biomedical  research,  as  evidenced 
by  the  isolation  of  a  number  of  genes  associated  with  important  diseases,  such  as 
Huntington's  disease,  amyotrophic  lateral  sclerosis,  neurofibromatosis  types  1  and  2, 
myotonic  dystrophy,  and  fragile  X  syndrome.  Genes  that  confer  a  predisposition  to  common 
diseases  such  as  breast  cancer,  colon  cancer,  hypertension,  diabetes  and  Alzheimer's  disease 
have  also  been  localized  to  specific  chromosomal  regions.  All  these  discoveries  benefitted 
from  the  information,  resources  and  technologies  developed  by  human  genome  research.  As 
the  genome  project  proceeds,  many  more  exciting  developments  are  expected  including 
technology  for  studying  the  health  effects  of  environmental  agents,  the  ability  to  decipher  the 
genomes  of  many  other  organisms,  including  countless  microbes  important  to  agriculture  and 
the  environment,  as  well  as  the  identification  of  many  more  genes  involved  in  disease  The 
technology  and  data  produced  by  the  genome  project  will  provide  a  strong  stimulus  to  broad 
areas  of  biological  research  and  biotechnology.  Exciting  years  lie  ahead  as  the  Human 
Genome  Project  moves  toward  its  second  set  of  5 -year  goals. 
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Legend  for  Figure  1  (not  shown) 


Graphic  overview  of  the  new  goals  for  the  human  genome.  A  2-5  centiMorgan  genetic  map 
is  expected  to  be  completed  by  1995  and  a  physical  map  with  STS  markers  every  100  kb  by 
1998.  Efficient  methods  for  gene  identification  need  to  be  developed  and  refined.  The  DNA 
sequencing  goal  of  50  Megabases  per  year  by  1998  includes  all  DNA,  both  human  and  model 
organisms,  and  assumes  an  exponential  increase  in  sequencing  capacity  over  time.  Other 
important  goals  involving  model  organisms  are  not  shown  here,  but  are  described  in  the  text. 
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Introduction 

The  complete  set  of  instructions  for  making  an  organism  is  called  its  genome.  It 
contains  the  master  blueprint  for  all  cellular  structures  and  activities  for  the  lifetime  of 
the  cell  or  organism.  Found  in  every  nucleus  of  a  person's  many  trillions  of  cells,  the 
human  genome  consists  of  tightly  coiled  threads  of  deoxyribonucleic  acid  (DNA)  and 
associated  protein  molecules,  organized  into  structures  called  chromosomes  (Fig.  1). 


Fig.  1.  The  Human  Genome  at  Four  Levels  of  Detail.  Apart  from  reproductive  cells  (gametes)  and 
mature  red  blood  cells,  every  cell  in  the  human  body  contains  23  pairs  of  chromosomes,  each  a 
packet  of  compressed  and  entwined  DNA  (1,  2).  Each  strand  of  DNA  consists  of  repeating 
nucleotide  units  composed  of  a  phosphate  group,  a  sugar  (deoxyribose),  and  a  base  (guanine, 
cytosine,  thymine,  or  adenine)  (3).  Ordinarily,  DNA  takes  the  form  of  a  highly  regular  double- 
stranded  helix,  the  strands  of  which  are  linked  by  hydrogen  txjnds  between  guanine  and  cytosine 
and  between  thymine  and  adenine.  Each  such  linkage  is  a  base  pair  (bp);  some  3  billion  bp 
constitute  the  human  genome.  The  specificity  of  these  base-pair  linkages  underlies  the  mechanism 
of  DNA  replication  illustrated  here.  Each  strand  of  the  double  helix  serves  as  a  template  lor  the 
synthesis  of  a  new  strand;  the  nucleotide  sequence  (i.  e. ,  linear  order  ol  bases)  of  each  strand  is 
strictly  determined.  Each  new  double  helix  is  a  twin,  an  exact  replica,  of  its  parent.  (Figure  and 
caption  text  provided  by  the  LBL  Human  Genome  Center.) 
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If  unwound  and  tied  together,  the  strands  of  DNA  would  stretch  more  than  5  feet  but 
would  be  only  50  trillionths  of  an  inch  wide.  For  each  organism,  the  components  of  these 
slender  threads  encode  all  the  information  necessary  for  building  and  maintaining  life, 
from  simple  bacteria  to  remarkably  complex  human  beings.  Understanding  how  DNA 
performs  this  function  requires  some  knowledge  of  its  structure  and  organization. 


Fig.  2.  DNA  Structure. 

The  four  nitrogenous 
bases  of  DNA  are 
arranged  along  Itie  sugar- 
phosptiate  backbone  in  a 
particular  order  (ttie  DNA 
sequence),  encoding  all 
genetic  instructions  for  an 
organism.  Adenine  (A) 
pairs  with  thymine  (T), 
while  cytosine  (C)  pairs 
with  guanine  (G).  The  two 
DNA  strands  are  held 
together  by  weak  bonds 
between  the  bases. 
A  gene  is  a  segment  of 
a  DNA  molecule  (rang- 
ing from  fewer  than 
1  thousand  bases  to 
several  million),  located 
in  a  particular  position  on 
a  specific  chromosome, 
whose  base  sequence 
contains  the  information 
necessary  for  protein 
synthesis. 


DNA 

In  humans,  as  in  other  higher  organisms,  a  DNA  molecule  consists  of  two  strands  that 
wrap  around  each  other  to  resemble  a  twisted  ladder  whose  sides,  made  of  sugar  and 
phosphate  molecules,  are  connected  by  "rungs"  of  nitrogen-containing  chemicals  called 
bases.  Each  strand  is  a  linear  arrangement  of  repeating  similar  units  called  nucleotides, 
which  are  each  composed  of  one  sugar,  one  phosphate,  and  a  nitrogenous  base  (Fig. 
2).  Four  different  bases  are  present  in  DNA— adenine  (A),  thymine  (T),  cytosine  (C),  and 
guanine  (G).  The  particular  order  of  the  bases  arranged  along  the  sugar-phosphate 
backbone  is  called  the  DNA  S'^quence;  the  sequence  specifies  the  exact  genetic  instruc- 
tions required  to  create  a  particular  organism  with  its  own  unique  traits. 


The  two  DNA  strands  are  held  together 
by  weak  bonds  between  the  bases  on 
each  strand,  forming  base  pairs  (bp). 
Genome  size  is  usually  stated  as  the  total 
number  of  base  pairs;  the  human  genome 
contains  roughly  3  billion  bp  (Fig.  3). 


Phosphate  Molecule 

-  Deoxyhbose 
Sugar  Molecule 


Sugar-Phosphate 
Backbone 


Each  lime  a  cell  divides  Into  two  daughter 
cells,  its  full  genome  is  duplicated;  for 
humans  and  other  complex  organisms, 
this  duplication  occurs  in  the  nucleus. 
During  cell  division  the  DNA  molecule 
unwinds  and  the  weak  bonds  between 
the  base  pairs  break,  allowing  the  strands 
to  separate.  Each  strand  directs  the 
synthesis  of  a  complementary  new 
strand,  with  free  nucleotides  matching  up 
with  their  complementary  bases  on  each 
of  the  separated  strands.  Strict  base- 
pairing  ru'ss  are  adhered  to — adenine  will 
pair  only  vvith  thymine  (an  A-T  pair)  and 
cytosine  with  guanine  (a  C-G  pair).  Each 
daughter  cell  receives  one  old  and  one 
new  DNA  strand  (Figs.  1  and  4).  The 
cell's  adherence  to  these  base-pairing 
rules  ensures  that  the  new  strand  is  an 
exact  copy  of  the  old  one.  This  minimizes 
the  incidence  of  errors  (mutations)  that 
may  greatly  affect  the  resulting  organism 
or  its  offspring. 
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Genes 

Each  DNA  molecule  contains  many  genes — the  basic  physical  and  functional  units  of 
heredity.  A  gene  is  a  specific  sequence  of  nucleotide  bases,  whose  sequences  carry  the 
information  required  for  constructing  proteins,  which  provide  the  structural  components  of 
cells  and  tissues  as  well  as  enzymes  for  essential  biochemical  reactions.  The  human 
genome  is  estimated  to  comprise  at  least  100,000  genes. 

Human  genes  vary  widely  in  length,  often  extending  over  thousands  of  bases,  but  only 
about  10%  of  the  genome  is  known  to  include  the  protein-coding  sequences  (exons)  of 
genes.  Interspersed  within  many  genes  are  intron  sequences,  which  have  no  coding 
function.  The  balance  of  the  genome  is  thought  to  consist  of  other  noncoding  regions 
(such  as  control  sequences  and  intergenic  regions),  whose  functions  are  obscure.  All 
living  organisms  are  composed  largely  of  proteins;  humans  can  synthesize  at  least 
100,000  different  kinds.  Proteins  are  large,  complex  molecules  made  up  of  long  chains  of 
subunits  called  amino  acids.  Twenty  different  kinds  of  amino  acids  are  usually  found  in 
proteins.  Within  the  gene,  each  specific  sequence  of  three  DNA  bases  (codons)  directs 
the  cell's  protein-synthesizing  machinery  to  add  specific  amino  acids.  For  example,  the 
base  sequence  ATG  codes  for  the  amino  acid  methionine.  Since  3  bases  code  for 
1  amino  acid,  the  protein  coded  by  an  average-sized  gene  (3000  bp)  will  contain  1000 
amino  acids.  The  genetic  code  is  thus  a  series  of  codons  that  specify  which  amino  acids 
are  required  to  make  up  specific  proteins. 

The  protein-coding  instructions  from  the  genes  are  transmitted  indirectly  through  messen- 
ger ribonucleic  acid  (mRNA),  a  transient  intermediary  molecule  similar  to  a  single  strand 
of  DNA.  For  the  information  within  a  gene  to  be  expressed,  a  complementary  RNA  strand 
is  produced  (a  process  called  transcription)  from  the  DNA  template  in  the  nucleus.  This 


Comparative  Sequence  Sizes 

Bases 

•  Largest  known  continuous  DNA  sequence 

350  Thousand 

(yeast  chromosome  3) 

•  Escherichia  co// (bacterium)  genome 

4.6  Million 

•  Largest  yeast  chromosome  now  mapped 

5.8  Million 

•  Entire  yeast  genome 

15  Million 

•  Smallest  human  chromosome  (Y) 

50  Million 

•  Largest  human  chromosome  (1) 

250  Million 

•  Entire  human  genome 

3  Billion 

Fig.  3.  Comparison  of  Largest  Known  DNA  Sequence  with  Approximate  Chromosome  and 
Genome  Sizes  of  Model  Organisms  and  Humans.  A  major  focus  of  the  Human  Genome  Project 
is  the  development  of  sequencing  schemes  that  are  faster  and  more  economical. 
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mRNA  is  moved  from  the  nucleus  to  the  cellular  cytoplasm,  where  it  serves  as  the  tem- 
plate for  protein  synthesis.  The  cell's  protein-synthesizing  machinery  then  translates  the 
codons  into  a  string  of  amino  acids  that  will  constitute  the  protein  molecule  for  which  it 
codes  (Fig.  5).  In  the  laboratory,  the  mRNA  molecule  can  be  isolated  and  used  as  a 
template  to  synthesize  a  complementary  DNA  (cDNA)  strand,  which  can  then  be  used  to 
locate  the  corresponding  genes  on  a  chromosome  map.  The  utility  of  this  strategy  is 
described  in  the  section  on  physical  mapping. 


Chromosomes 

The  3  billion  bp  in  the  human  genome  are  organized  into  24  distinct,  physically  separate 
microscopic  units  called  chromosomes.  All  genes  are  arranged  linearly  along  the  chromo- 
somes. The  nucleus  of  most  human  cells  contains  2  sets  of  chromosomes,  1  set  given  by 
each  parent.  Each  set  has  23  single  chromosomes — 22  autosomes  and  an  X  or  Y  sex 
chromosome.  (A  normal  female  will  have  a  pair  of  X  chromosomes;  a  male  will  have  an  X 
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Fig.  4.  DNA  Replication. 

During  replication  the  DNA 
molecule  unwinds,  with 
each  single  strand 
becoming  a  template  lor 
synthesis  ola  new, 
complementary  strand. 
Each  daughter  molecule, 
consisting  ol  one  old  and 
one  new  DNA  strand,  is  an 
exact  copy  ol  the  parent 
molecule.  [Source: 
adapted  Irom  Mapping  Our 
Genes — The  Genome 
Projects:  How  Big,  How 
Fast?  U.S.  Congress, 
Otiice  ol  Technology 
Assessment,  OTA-BA-373 
(Washington,  D.C.:  U.S. 
Government  Printing 
Otiice,  1988).] 
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and  Y  pair.)  Chromosomes  contain  roughly  equal  parts  of  protein  and  DNA;  chromosomal 
DNA  contains  an  average  of  150  million  bases.  DNA  molecules  are  among  the  largest 
molecules  now  known. 

Chromosomes  can  be  seen  under  a  light  microscope  and,  when  stained  with  certain  dyes, 
reveal  a  pattern  of  light  and  dark  bands  reflecting  regional  variations  in  the  amounts  of  A 
and  T  vs  G  and  C.  Differences  in  size  and  banding  pattern  allow  the  24  chromosomes  to 
be  distinguished  from  each  other,  an  analysis  called  a  karyotype.  A  few  types  of  major 
chromosomal  abnormalities,  including  missing  or  extra  copies  of  a  chromosome  or  gross 
breaks  and  rejoinings  (translocations),  can  be  detected  by  microscopic  examination; 
Down's  syndrome,  in  which  an  individual's  cells  contain  a  third  copy  of  chromosome  21,  is 
diagnosed  by  karyotype  analysis  (Fig.  6).  N/lost  changes  in  DNA,  however,  are  loo  subtle  to 
be  detected  by  this  technique  and  require  molecular  analysis.  These  subtle  DNA  abnor- 
malities (mutations)  are  responsible  for  many  inherited  diseases  such  as  cystic  fibrosis  and 
sickle  cell  anemia  or  may  predispose  an  individual  to  cancer,  major  psychiatric  illnesses, 
and  other  complex  diseases. 
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Fig.  5.  Gene  Expression.  When  genes  are  expressed,  the  genetic  information  (base  sequence)  on  DNA  is  first  transcritted 
(copied)  to  a  molecule  of  messenger  RNA  in  a  process  similar  to  DNA  replication.  The  mRNA  molecules  then  leave  the  cell 
nucleus  and  enter  the  cytoplasm,  where  triplets  of  bases  (codons)  forming  the  genetic  code  specify  the  particular  amino  acids  that 
make  up  an  individual  protein.  This  process,  called  translation,  is  accomplished  by  nbosomes  (cellular  components  composed  of 
proteins  and  another  class  of  RNA)  that  read  the  genetic  code  from  the  mRNA,  and  transfer  RNAs  (tRNAs)  that  transport  amino 
adds  to  the  ribosomes  for  attachment  to  the  growing  protein.  (Source:  see  Fig.  4.) 
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Fig.  6.  Karyotype.  Micmscopic  examination  of  chromosome  size  and  banding  patterns  allows 
medical  laboratories  to  identify  and  arrange  eact)  of  the  24  different  chromosomes  (22  pairs  of 
autosomes  and  one  pair  of  sex  chromosomes)  into  a  karyotype,  which  then  serves  as  a  tool  in  the 
diagnosis  of  genetic  diseases.  The  extra  copy  of  chromosome  21  in  this  karyotype  identifies  this 
individual  as  having  Down 's  syndrome. 


Mapping  and  Sequencing  ttie  Human  Genome 

A  primary  goal  of  the  Human  Genome  Project  is  to  matce  a  series  of  descriptive  dia- 
grams— maps — of  eacfi  human  chromosome  at  increasingly  finer  resolutions.  Mapping 
involves  (1)  dividing  the  chromosomes  Into  smaller  fragments  that  can  be  propagated  and 
char-acterized  and  (2)  ordering  (mapping)  them  to  correspond  to  their  respective  locations 
on  the  chromosomes.  After  mapping  is  completed,  the  next  step  Is  to  determine  the  base 
sequence  of  each  of  the  ordered  DfMA  fragments.  The  ultimate  goal  of  genome  research  Is 
to  find  all  the  genes  in  the  DNA  sequence  and  to  develop  tools  for  using  this  information  in 
the  study  of  human  biology  and  medicine.  Improving  the  Instnjmentation  and  techniques 
required  for  mapping  and  sequencing — a  major  focus  of  the  genome  project — will  in- 
crease efficiency  Jind  cost-effectiveness.  Goals  Include  automating  methods  and  optimiz- 
ing techniques  to  extract  the  maximum  useful  Information  from  maps  and  sequences. 

A  genome  map  describes  the  order  of  genes  or  other  markers  and  the  spacing  between 
them  on  each  chromosome.  Human  genome  maps  are  constructed  on  several  different 
scales  or  levels  of  resolution.  At  the  coarsest  resolution  are  genetic  linkage  mt^js,  which 
depict  the  relative  chromosomal  locations  of  DNA  mariners  (genes  and  other  identifiable 
DNA  sequences)  by  their  patterns  of  inheritance.  Physical  maps  describe  the  chemical 
characteristics  of  the  DNA  molecule  itself. 


to 
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Geneticists  have  already  charted  the  approximate  positions  of  over  2300  genes,  and  a 
start  has  been  nnade  in  establishing  high-resolution  maps  of  the  genome  (Fig.  7).  More- 
precise  maps  are  needed  to  organize  systematic  sequencing  efforts  and  plan  new 
research  directions. 


Mapping  Strategies 

Genetic  Linkage  l\/laps 

A  genetic  linkage  map  shows  the  relative  locations  of  specific  DNA  markers  along  the 
chromosome.  Any  inherited  physical  or  molecular  characteristic  that  differs  among  indi- 
viduals and  is  easily  detectable  in  the  laboratory  is  a  potential  genetic  marker.  Markers 
can  be  expressed  DNA  regions  (genes)  or  DNA  segments  that  have  no  known  coding 
function  but  whose  inheritance  pattern  can  be  followed.  DNA  sequence  differences  are 
especially  useful  markers  because  they  are  plentiful  and  easy  to  characterize  precisely. 
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Fig.  7.  Assignment  of  Genes 
to  Specific  Cttromosomes. 

The  number  of  genes  assigned 
(mapped)  to  specific  cttromo- 
somes has  greatly  increased  since 
the  first  autosomal  (i.  e. ,  not  on  the 
X  or  Y  chromosome)  marker  was 
mapped  in  1968.  Most  ol  these 
genes  have  been  mapped  to 
specific  bands  on  chromosomes. 
The  acceleration  ol  chromosome 
assignments  is  due  to  (1)  a  com- 
bination ol  improved  and  new 
techniques  in  chromosome  sorting 
and  band  analysis.  (2)  data  from 
family  studies,  and  (3)  the  intro- 
duction of  recombinant  DNA 
technology.  [Source:  adapted  from 
Victor  A.  McKusick,  "Cunent 
Trends  in  Mapping  Human 
Genes,  "The  FASEB  Journal  5(1). 
12(1991).] 
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Markers  must  be  polymorphic  to  be  useful  In  mapping;  that  is,  alternative  forms  must  exist 
among  individuals  so  that  they  are  detectable  among  different  members  in  family  studies. 
Polymorphisms  are  variations  in  DNA  sequence  that  occur  on  average  once  every  300  to 
500  bp.  Variations  within  axon  sequences  can  lead  to  observable  changes,  such  as  differ- 
ences in  eye  color,  blood  type,  and  disease  susceptibility.  Most  variations  occur  within 
introns  and  have  little  or  no  effect  on  an  organism's  appearance  or  function,  yet  they  are 
detectable  at  the  DNA  level  and  can  be  used  as  markers.  Examples  of  these  types  of 
markers  include  (1)  restriction  fragment  length  polymorphisms  (RFLPs),  which  reflect 
sequence  variations  in  DNA  sites  that  can  be  cleaved  by  DNA  restriction  enzymes  (see 
box),  and  (2)  variable  number  of  tandem  repeat  sequences,  which  are  short  repeated 
sequences  that  vary  in  the  number  of  repeated  units  and,  therefore,  in  length  (a  character- 
istic easily  measured).  The  human  genetic  linkage  map  is  constructed  by  observing  how 
frequently  two  markers  are  inherited  together. 

Two  markers  located  near  each  other  on  the  same  chromosome  will  tend  to  be  passed 
together  from  parent  to  child.  During  the  normal  production  of  sperm  and  egg  cells,  DNA 
strands  occasionally  break  and  rejoin  in  different  places  on  the  same  chromosome  or  on 
the  other  copy  of  the  same  chromosome  (i.e.,  the  homologous  chromosome).  This  process 
(called  meiotic  recombination)  can  result  in  the  separation  of  two  markers  originally  on  the 
same  chromosome  (Fig.  8).  The  closer  the  markers  are  to  each  other — the  more  "tightly 
linked" — the  less  likely  a  recombination  event  will  fall  between  and  separate  them.  Recom- 
bination frequency  thus  provides  an  estimate  of  the  distance  between  two  markers. 

On  the  genetic  map,  distances  between  markers  are  measured  in  terms  of  centimorgans 
(cM),  named  after  the  American  geneticist  Thomas  Hunt  Morgan.  Two  markers  are  said  to 
be  1  cM  apart  if  they  are  separated  by  recombination  1%  of  the  time.  A  genetic  distance  of 
1  cM  is  roughly  equal  to  a  physical  distance  of  1  million  bp  (1  Mb).  The  current  resolution 
of  most  human  genetic  map  regions  is  about  10  Mb. 

The  value  of  the  genetic  map  is  that  an  inherited  disease  can  be  located  on  the  map  by 
following  the  inheritance  of  a  DNA  marker  present  in  affected  individuals  (but  absent  in 
unaffected  individuals),  even  though  the  molecular  basis  of  the  disease  may  not  yet  be 
understood  nor  the  responsibte«gera9i»dentified.  Genetic  maps  have  been  used  to  find  the 

exact  chromosomal  location  of  several  impor- 
tant disease  genes,  including  cystic  fibrosis, 
sickle  cell  disease,  Tay-Sachs  disease,  fragile 
X  syndrome,  and  myotonic  dystrophy. 


HiMW&mim^^s^Phdse^^ 


Resolution 

2Mb 

0.1  Mb 

5kb 

1  bp 

•  Complete  a  detailed  human  genetic  map 

•  Complete  a  physical  map 

•  Acquire  the  genome  as  clones 

•  Determine  the  complete  sequence 

•  Find  all  the  genes 


With  the  data  generated  by  the  project,  investigators 
will  determine  the  functions  of  the  genes  and  develop 
tools  for  biological  and  medical  applications. 


One  short-term  goal  of  the  genome  project  is 
to  develop  a  high-resolution  genetic  map  (2  to 
5  cM);  recent  consensus  maps  of  some  chro- 
mosomes have  averaged  7  to  10  cM  between 
genetic  markers.  Genetic  mapping  resolution 
has  been  increased  through  the  application  of 
recombinant  DNA  technology,  including  in  vitro 
radiation-induced  chromosome  fragmentation 
and  cell  fusions  (joining  human  cells  with  those 
of  other  species  to  form  hybrid  cells)  to  create 
panels  of  cells  with  specific  and  varied  human 
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Recombinant:  Frequency  of  this  event  reflects  the  distance 
between  genes  for  the  marker  M  and  HD. 


Fig.  8.  Constructing  a  Genetic 
Linkage  Map.  Genetic  linkage 
maps  of  each  chromosome  are 
made  by  determining  how  fre- 
quently two  markers  are  passed 
together  from  parent  to  child. 
Because  genetic  material  is  some- 
times exchanged  during  the  pro- 
duction of  sperm  and  egg  cells, 
groups  of  traits  (or  markers)  origi- 
nally together  on  one  chromosome 
may  not  be  inherited  together. 
Closely  linked  markers  are  less 
likely  to  be  separated  by  spon- 
taneous chromosome  rearrange- 
ments. In  this  diagram,  the  vertical 
lines  represent  chromosome  4 
pairs  for  each  individual  in  a  family. 
The  father  has  two  traits  that  can 
be  delected  in  any  child  who 
inherits  them:  a  short  known  DNA 
sequence  used  as  a  genetic 
marker  (M)  and  Huntington 's 
disease  (HD).  The  fact  that  one 
child  received  only  a  single  trait  (M) 
from  that  particular  chromosome 
indicates  that  the  father's  genetic 
material  recombined  during  the 
process  of  sperm  production.  The 
frequency  of  this  event  helps  deter- 
mine the  distance  between  the  two 
DNA  sequences  on  a  genetic  map . 


chromosomal  components.  Assessing  the  frequency  of  marl<er  sites  remaining  together 
after  radiation-induced  DNA  fragmentation  can  establish  the  order  and  distance  between 
the  markers.  Because  only  a  single  copy  of  a  chromosome  is  required  for  analysis,  even 
nonpolymorphic  markers  are  useful  in  radiation  hybrid  mapping.  [In  meiotic  mapping 
(described  above),  two  copies  of  a  chromosome  must  be  distinguished  from  each  other  by 
polymorphic  markers.] 

Physical  Maps 

Different  types  of  physical  maps  vary  in  their  degree  of  resolution.  The  lowest-resolution 
physical  map  is  the  chromosomal  (sometimes  called  cytogenetic)  map,  which  is  based  on 
the  distinctive  banding  patterns  observed  by  light  microscopy  of  stained  chromosomes.  A 
cDNA  map  shows  the  locations  of  expressed  DNA  regions  (exons)  on  the  chromosomal 
map.  The  more  detailed  cosmid  contig  map  depicts  the  order  of  overlapping  DNA  frag- 
ments spanning  the  genome.  A  macrorestriction  map  describes  the  order  and  distance 
between  enzyme  cutting  (cleavage)  sites.  The  highest-resolution  physical  map  is  the 
complete  elucidation  of  the  DNA  base-pair  sequence  of  each  chromosome  in  the  human 
genome.  Physical  maps  are  described  in  greater  detail  below. 
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Low-Resolution  Physical  Mapping 

Chromosomal  map.  in  a  chromosomal  map,  genes  or  other  identifiable  DNA  fragments 
are  assigned  to  their  respective  chromosomes,  with  distances  measured  in  base  pairs. 
These  markers  can  be  physically  associated  with  particular  bands  (identified  by  cytoge- 
netic staining)  primarily  by  In  situ  hybridization,  a  technique  that  involves  tagging  the  DNA 
marker  with  an  observable  label  (e.g.,  one  that  fluoresces  or  is  radioactive).  The  location 
of  the  labeled  probe  can  be  detected  after  It  binds  to  its  complementary  DNA  strand  in  an 
intact  chromosome. 


As  with  genetic  linkage  mapping,  chromosomal  mapping  can  be  used  to  locate  genetic 
markers  defined  by  traits  observable  only  in  whole  organisms.  Because  chromosomal 
maps  are  based  on  estimates  of  physical  distance,  they  are  considered  to  be  physical 
maps.  The  number  of  base  pairs  within  a  band  can  only  be  estimated. 

Until  recently,  even  the  best  chromosomal  maps  could  be  used  to  locate  a  DNA  fragment 
only  to  a  region  of  about  10  Mb,  the  size  of  a  typical  band  seen  on  a  chromosome. 
Improvements  in  fluorescence  in  situ  hybridization  (FISH)  methods  allow  orientation  of 
DNA  sequences  that  lie  as  close  as  2  to  5  K^b.  f\4odifications  to  in  situ  hybridization 
methods,  using  chromosomes  at  a  stage  in  cell  division  (interphase)  when  they  are  less 
compact,  increase  map  resolution  to  around  100,000  bp.  Further  banding  refinement 
might  allow  chromosomal  bands  to  be  associated  with  specific  amplified  DNA  fragments, 
an  improvement  that  could  be  useful  in  analyzing  observable  physical  traits  associated 
with  chromosomal  abnormalities. 

cDNA  map.  A  cDNA  map  shows  the  positions  of  expressed  DNA  regions  (exons) 
relative  to  particular  chromosomal  regions  or  bands.  (Expressed  DNA  regions  are  those 
transcribed  into  mRNA.)  cDNA  is  synthesized  in  the  laboratory  using  the  mRNA  molecule 
as  a  template;  base-pairing  rules  are  followed  (i.e.,  an  A  on  the  mRNA  molecule  will  pair 
with  a  T  on  the  new  DNA  strand).  This  cDNA  can  then  be  mapped  to  genomic  regions. 

Because  they  represent  expressed  genomic  regions,  cDNAs  are  thought  to  identify  the 
parts  of  the  genome  with  the  most  biological  and  medical  significance.  A  cDNA  map  can 
provide  the  chromosomal  location  for  genes  whose  functions  are  currently  unknown.  For 
disease-gene  hunters,  the  map  can  also  suggest  a  set  of  candidate  genes  to  test  when 
the  approximate  location  of  a  disease  gene  has  been  mapped  by  genetic  linkage  tech- 
niques. 

High-Resolution  Physical  Mapping 

The  two  current  approaches  to  high-resolution  physical  mapping  are  termed  "top-down" 
(producing  a  macroreslriction  map)  and  "tx)ttom-up"  (resulting  in  a  contig  map).  With 
either  strategy  (described  below)  the  maps  represent  ordered  sets  of  DNA  fragments  that 
are  generated  by  cutting  genomic  DNA  with  restriction  enzymes  (see  Restriction  En- 
zymes box  at  right).  The  fragments  are  then  amplified  by  cloning  or  by  polymerase  chain 
reaction  (PCR)  methods  (see  DNA  Amplification).  Electrophoretic  techniques  are  used  to 
separate  the  fragments  according  to  size  into  different  bands,  which  can  be  visualized  by 
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direct  DNA  staining  or  by  hybridization  with  DNA  probes  of  interest.  The  use  of  purified 
chromosomes  separated  either  by  flow  sorting  from  human  cell  lines  or  in  hybrid  cell  lines 
allows  a  single  chromosome  to  be  mapped  (see  Separating  Chromosomes  box  at  right). 

A  number  of  strategies  can  be  used  to  reconstruct  the  original  order  of  the  DNA  fragments 
in  the  genome,  f^any  approaches  make  use  of  the  ability  of  single  strands  of  DNA  and/or 
RNA  to  hybridize — to  form  double-stranded  segments  by  hydrogen  bonding  between 
complementary  bases.  The  extent  of  sequence  homology  between  the  two  strands  can  be 


Restilctiori  Enzymes:  Microscopic  Scalpels 

isolated  fmrn  various  bacterta,  reslrfction  eoiymes  r«eogniz6  shori  ONA  sequences 
and  CKl  tt>e  DNA  molecules  at  those  ^)edftc  slf^s.  (A  naiurattwtogicaf  function  of 
these  enzymes  «s  to  protect  bacteris  by  attscking  wat  and  ottwr  foreign  DNA.)  Some 
festrtction  feizymes  (nare-cuttersj  Cut  the  DNA  very  in:ffequeftt{y,  generating  a  small 
nun't>ef  of  very  iarge  fragments  (several  thousand  to  a  million  bp).  Most  enzymes  cut 
ONA  more  frequently,  ttius  generating  a  tatge  ntiiT4»er  o(  smalt  fragmenis  (te«s  than  a 
tiundfed  to  more  ftian  a  thousand  bp). 

On  average,  restriction  enzymes  w^ 

•  4-base  recognliion  sites  wilt  yield  pieces  256  bases  long, 

•  6-base  recognition  sites  will  yieW  pieces  4000  bases  tof^,  and 

•  8-base  recognitjon  sites  will  yield  p»sces  64,000  t5ases  long. 

Since  hun<ireds  of  ditterent  restriction  ervzymes  have  been  characterized,  Ot>iA  can 
t>e  cut  into  many  different  small  fragmerrts. 


Separating  Chromosomes 


Flow  sorting 

Pioneered  at  Los  Aiarws  Nattonai  l,aboratory  (lAtMi),  (tow  sorting  etrjptoys  How 
cytometry  to  separate,  according  to  size,  ctvomosomes  isolated  irom  cells  during 
cell  divJSfOn  when  they  are  condensed  and  stable.  As  tt»  chromosomes  flow  singly 
past  a  iacser  beam,  they  are  diReren-ti^ed  tjy  analyang  the  amount  of  DNA  present, , 
aiKl  indivitJuai  ci*rorfK>sonjes  are  directed  to  specitic  collection  tubes, 

Somatic  celt  tiyt>fi<fi2ation 

In  somatic  ceil  tiyt>r(dizatiOfi,  human  celts  and  ro<fent  tymor  cells  are  fused  (tiytwid- 
tzed);  Over  time,  after  tJte  ttiromosomes  mix,  human  ctiromosomes  are  preferentiaity 'll 
tost  from  Hie  iiybrid  ceft  until  only  one  or  a  few  remain.  Those  individuat  hybrid  celts  '" 
are  ttien  propagated  and  maintained  as  celt  lines  corrtaimng  specific  jiuman  chrotTHV  |i 
somes,  trnprovemenis  to  this  techtwjue  have  genei^ted  a  number  of  hybrid  ceS 
tines,  each  wih  a  ^>eciic  sir»gle  tHjman  chromosome. 
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inferred  from  the  length  of  the  double-stranded  segment.  Fingerprinting  uses  restriction 
map  data  to  determine  which  fragments  have  a  specific  sequence  (fingerprint)  In  common 
and  therefore  overlap.  Another  approach  uses  linking  clones  as  probes  for  hybridization  to 
chromosomal  DNA  cut  with  the  same  restriction  enzyme. 

Macrorestriction  maps:  Top-down  mapping.  In  top-down  mapping,  a  single 
chromosome  is  cut  (with  rare-cutter  restriction  enzymes)  into  large  pieces,  which  are 
ordered  and  subdivided;  the  smaller  pieces  are  then  mapped  further.  The  resulting  macro- 
restriction  maps  depict  the  order  of  and  distance  between  sites  at  which  rare-cutter 
enzymes  cleave  (Fig.  9a).  This  approach  yields  maps  with  more  continuity  and  fewer  gaps 
between  fragments  than  contig  maps  (see  below),  but  map  resolution  is  lower  and  may 
not  be  useful  in  finding  particular  genes;  in  addition,  this  strategy  generally  does  not 
produce  long  stretches  of  mapped  sites.  Cunrently,  this  approach  allows  DNA  pieces  to  be 
located  in  regions  measuring  about  100,000  bp  to  1  Mb. 

The  development  of  pulsed-fleld  gel  (PFG)  electrophoretic  methods  has  improved  the 
mapping  and  cloning  of  large  DNA  molecules.  While  conventional  gel  electrophoretic 
methods  separate  pieces  less  than  40  kb  (1  kb  =  1000  bases)  in  size,  PFG  separates 
molecules  up  to  10  Mb,  allowing  the  application  of  tx)th  conventional  and  new  mapping 
methods  to  larger  genomic  regions. 


Top 
Down 


(a) 

Chromosome 


(b) 

Linked  Library 
Detailed  but  incomplete 


i  I 


Macrorestriction  Map 

Complete  but  low  resolution 


Arrayed  Library 


Bottom 
Up 


Fig.  9.  Physlcml  Mapping  Strategies.  Top-down  physical  mapping  (a)  produces  maps  with  few  gaps,  but  map  resolution  may  not 
avow  location  ol  specific  genes.  Bottorryup  strategies  (b)  generate  extrenwiy  detailed  maps  of  small  areas  but  lea  ve  many  gaps. 
A  combination  of  both  approaches  is  being  used.  [Source:  Adapted  from  P.  R.  Billings  et  al.,  'New  Techniques  lor  Physical ' 
Mapping  of  the  Human  Genome,  The  FASEB  Journal  5(1),  29  (1991).] 
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Contig  maps:  Bottom-up  mapping.  The  bottom-up  approach  involves  cutting  the 
chromosome  Into  small  pieces,  each  of  which  is  cloned  and  ordered.  The  ordered  frag- 
ments form  contiguous  DNA  blocks  (contigs).  Currently,  the  resulting  "library"  of  clones 
varies  in  size  from  10,000  bp  to  1  Mb  (Fig.  9b).  An  advantage  of  this  approach  is  the 
accessibility  of  these  stable  clones  to  other  researchers.  Contig  construction  can  be 
verified  by  FISH,  which  localizes  cosmids  to  specific  regions  within  chromosomal  bands. 

Contig  maps  thus  consist  of  a  linked  library  of  small  overlapping  clones  representing  a 
complete  chromosomal  segment.  While  useful  for  finding  genes  localized  to  a  small  area 
(under  2  Mb),  contig  maps  are  difficult  to  extend  over  large  stretches  of  a  chromosome 
because  all  regions  are  not  clonable.  DNA  probe  techniques  can  be  used  to  fill  in  the 
gaps,  but  they  are  time  consuming.  Figure  10  is  a  diagram  relating  the  different  types  of 
maps. 


Technological  improvements  now  make  possible  the  cloning  of  large  DNA  pieces,  using 
artificially  constructed  chromosome  vectors  that  carry  human  DNA  fragments  as  large  as 
1  Mb.  These  vectors  are  maintained  in  yeast  cells  as  artificial  chromosomes  (YACs).  (For 
more  explanation,  see  DNA  Amplification.)  Before  YACs  were  developed,  the  largest 
cloning  vectors  (cosmids)  carried  inserts  of  only  20  to  40  kb.  YAC  methodology  drastically 
reduces  the  number  of  clones  to  be  ordered;  many  YACs  span  entire  human  genes.  A 
more  detailed  map  of  a  large  YAC  insert  can  be  produced  by  subcloning,  a  process  in 
which  fragments  of  the  original  insert  are  cloned  into  smaller-insert  vectors.  Because 
some  YAC  regions  are  unstable,  large-capacity  bacterial  vectors  (i.e.,  those  that  can 
accommodate  large  inserts)  are  also  being  developed. 
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GENETIC 
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RESTRICTION 
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ORDERED 
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SEQUENCE 


i 


Gene  or 
Polymorphism 


i 


Fig.  10.  Types  of  Genome 
Uaps.  At  the  coarsest  resolution, 
the  genetic  map  measures 
recombination  frequency  between 
United  mariners  (genes  or  poly- 
morphisms). At  the  next  reso- 
lution level,  restriction  fragments 
ol  1  to  2  Mb  can  be  separated 
and  mapped.  Ordered  libranes  ol 
cosmids  and  YACs  have  insert 
sizes  from  40  to  400  kb.  The  base 
sequence  is  the  ultimate  physical 
map.  Chromosomal  mapping  (not 
shown)  locates  genetic  sites  in 
relation  to  bands  on  chromo- 
somes (estimated  resolution  ol 
5  Mb);  new  in  situ  hybndization 
techniques  can  place  lod  100  kb 
apart.  These  direct  strategies 
link  the  other  four  mapping 
approaches  diagramed  here. 
[Source:  see  Fig.  9.] 
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Sequencing  Technologies 


The  ultimate  physical  map  of  the  human  genome  is  the  complete  DNA  sequence — the 
detemiinatlon  of  all  base  pairs  on  each  chromosome.  The  completed  map  will  provide 
biologists  with  a  Rosetta  stone  for  studying  human  biology  and  enable  medical  research- 
ers to  t>egin  to  unravel  the  mechanisms  of  Inherited  diseases.  Much  effort  continues  to  be 
spent  locating  genes;  if  the  full  sequence  were  known,  emphasis  could  shift  to  determining 
gene  function.  The  Human  Genome  Project  is  creating  research  tools  for  21  st-century 
biology,  when  the  goal  will  be  to  understand  the  sequence  and  functions  of  the  genes 
residing  therein. 

Achieving  the  goals  of  the  Human  Genome  Project  will  require  substantial  improvements 
in  the  rate,  efficiency,  and  reliability  of  standard  sequencing  procedures.  While  technologi- 
cal advances  are  leading  to  the  automation  of  standard  DNA  purification,  separation,  and 
detection  steps,  efforts  are  also  focusing  on  the  development  of  entirely  new  sequencing 
methods  that  may  eliminate  some  of  these  steps.  Sequencing  procedures  currently 
Involve  first  sut)Cloning  DNA  fragments  front  a  cosmid  or  bacteriophage  library  into  special 
sequencing  vectors  that  carry  shorter  pieces  of  the  original  cosmid  fragments  (Fig.  1 1 ). 
The  next  step  Is  to  make  the  subcloned  fragments  Into  sets  of  nested  fragments  differing 
In  length  by  one  nucleotide,  so  that  the  speclfk;  base  at  the  end  of  each  successive 
fragment  Is  detectable  after  the  fragments  have  been  separated  by  gel  electrophoresis. 
Current  sequencing  technologies  are  discussed  later. 
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____---'  PARTIAL  NUCLEOTIDE  SEQUENCE  ~~---___ 

—  —  (from  human  p-globin  gene) 

GGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCC 
TGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGG 


Fig.  1 1.  Constructing  Clones  tor  Sequencing.  Clonod  DNA  molecules  must  be  made 
progressively  smaller  artd  the  Iragments  subcloned  into  new  vectors  to  obtain  fragments  small 
enough  for  use  with  current  sequencing  technology.  Sequencing  results  are  compiled  to  provide 
longer  stretches  of  sequence  across  a  chromosome.  (Source:  adapted  from  David  A.  Micklos  and 
Greg  A.  Freyer,  DNA  Science,  A  First  Course  in  Recombinant  DNA  Technology,  Burlington,  N.C.: 
Carolina  Biological  Supply  Company,  1990.) 
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DNA  Amplification: 
Cloning  and  Polymerase 
Ciiain  Reaction  (PCR) 

Ctoning  (in  vivo  DMA 
amplification) 

Ckming  invoh/es  the  use  o<  recorrtHnant  DNA 
technotogy  to  propagate  OKA  fragments  inside  a 
foreign  host.  The  fragments  are  usually  isolated 
from  chromosomes  using  restriction  enzymes 
and  ttien  united  with  a  carrier  (a  vectoi^.  FoSow- 
ing  introduction  into  suitable  host  ceils,  the  DNA 
Iragments  can  then  be  reproduced  along  with  the 
host  cell  tWA.  Vectors  are  DMA  moiecuJes 
originating  from  viruses,  bacteria,  and  yeast 
cells.  They  accomnjodale  various  sizes  of 
foreign  Df^A  fragments  ranging  from  12.000  bp 
for  bacterial  vectors  (plasmids  and  cosmids)  to 
1  Mb  for  yeast  vectors  (yeast  artiSciai  chromo- 
somes). Bacteria  are  nxTSt  often  the  ho^  for 
Oiese  inserts,  but  yeast  and  mammalian  cells 
are  also  used  (a). 

Cloning  procedures  provide  unlimited  rrateriaf  for 
experirriental  study.  A  rarKtom  (unordered)  set  of 
cloned  ONA  fragments  is  caBed  a  ttbrary. 
Genomic  litaaries  are  sets  of  overlapping  frag- 
ments wicon^iassing  an  entire  genome  (b).  /«so 
available  a^e  chromosome-specific  Stsraries, 
which  consisl  of  fragments  <ferived  from  source 
DNA  enriched  for  a  particular  chromosome.  (See 
Separating  Chromosomes  box.) 
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mpScaSng  drcutar  molacula  of  DNA)  mat  is  saparate  from 
chfomosomal  DNA.  Wtien  the  recombinant  plasmid  is  intto- 
daced  into  badefia,  the  ne»ly  insariad  segment  wit  be 
rep6cstad  e^ong  wUtt  ma  rest  oflhaptasmid. 
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(b)  ConstfueUng  an 
Overlapping  Clone  Ubtary. 

A  caHecHon  ofdonos  of 
Chromosomal  DNA,  called  a 
library,  has  no  obvlovs  order 
itniicstirtg  the  original  posit- 
ions ot  the  doned  pieces  on 
the  uncut  chrorrtosome. 
To  establish  that  two  partic- 
ular dones  are  adjacent  to 
each  other  in  the  genome, 
libraries  of  clones  containing 
partly  overlapping  regions 
must  be  coftstrvcled.  These 
done  Hbranes  are  ordered  t)y 
dividing  the  tnsarts  mto  smaller 
fragments  and  determining 
which  clones  share  common 
DNA  sequences. 
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PCR  (in  vitro  DNA  ampitfication) 

Descntoed  as  beirtg  to  genas  what  Gutenberg's  prtnltng  press  was  to  th<6  wrttten  wof<t  PCR  can  ampJify  s 
tJssired  DNA  sequence  o<  any  origm  {virus  bacteria,  plant,  or  human}  hund'eds  oi  mllons  of  limes  in  a 
matter  of  hours,  a  task  fhaf  woutd  have  requtfed  severaJ  days  wtth  recombinant  technology.  PCR  is  espe- 
tiaSy  valufiWe  because  the  reaction  is  highly  specific,  easily  automated,  and  capi^e  of  art^ififyirtg  minute 
amounis  of  san^te.  For  theee  reasons,  PCR  has  also  had  a  major  impact  on  clir»ca}  medtciie.  gen^c 
disease  diagnostics,  forerstc  stSence,  and  evoJutitmary  btotogy 

PCB  h  a  process  based  or>  a  specialised  potymerase  enzyme^  wt»ch  can  syrjttiesize  a  corr^nDenta»y 
straiid  to  a  gsven  PWA  ^isnd  ki  &  rrtxtuj*  costairsng  tfte  4  DNA  tjas^  and  2  tMA  tr^rfteias  (prirrtars.  each 
atcAA  20  bases  io*^)  Kankir^  8ie  target  ssquence.  The  mixtare  »s  heated  to  sefstrale  tiw  strands  ol  doubfe- 
stranded  DNA  conlatoing  &*b  target  setjueRce  and  then  cooked  to  allow  {1 )  f he  primers  to  ftnd  and  bind  to 
tiei('CS)n^ptemeraary  6»{st«n<»a  on  the  separated  8traf>ds  ar»d  (2) »»  polymerase  to  extend  the  primers  into 
new  com^emehtajy  straraSs,  Repeated  hsatmg  and  cooling  oyciss  mtitiply  fte  taig^  DNA  axponentiaSy, 
Since  each  new  double  siratnd  sepamtes  to  become  two  tentf^ieles  for  fujiher  synthesis.  Ir*  M>out  t  hour,  20 
PCf^  cyctes  can  amplSy  ttje  target  by  a  miHonfold. 


ONA  Amp«fteatlon  Using  PCft 


.m 


TweercNA 

IlitlHIilHIIIIilli 


^^ 


nimitumi'i'ini'i 
ilUit 


m. 


iHHtUHlHUtf       # 


i'niu't'iimnmiii 


■r*^^  f  iiiiiiiiiiii . 


tllUtlif 


oeHKWfm 


HVWBOKE 
j>«illU£ftS 


NEWJ3NA 
STRAHOS 


nmTTOffmrmr 


Uiii* 


»lliH«iH.li»tl 


mmmmiM 


|il||tillliliililHi 


^«^    .  Illlilltlllillilllll 


BQUfOt.  aw*  &!*»*»,««  Rg,  1 1, 


22 


410 


Current  Sequencing  Technologies 

The  two  basic  sequencing  approaches,  Maxam-Giibert  and  Sanger,  differ  primarily  in  the 
way  the  nested  DNA  fragments  are  produced.  Both  methods  work  because  gel  electro- 
phoresis produces  very  high  resolution  separations  of  DNA  molecules;  even  fragments 
that  differ  in  size  by  only  a  single  nucleotide  can  be  resolved.  Almost  all  steps  in  these 
sequencing  methods  are  now  automated.  Maxam-Giibert  sequencing  (also  called  the 
chemical  degradation  method)  uses  chemicals  to  cleave  DNA  at  specific  bases,  resulting 
in  fragments  of  different  lengths.  A  refinement  to  the  Maxam-Giibert  method  known  as 
multiplex  sequencing  enables  investigators  to  analyze  about  40  clones  on  a  single  DNA 
sequencing  gel.  Sanger  sequencing  (also  called  the  chain  termination  or  dideoxy  method) 
Involves  using  an  enzymatic  procedure  to  synthesize  DNA  chains  of  varying  length  in  four 
different  reactwns,  stopping  the  DNA  replication  at  positions  occupied  by  one  of  the  four 
bases,  and  then  determining  the  resulting  fragment  lengths  (Fig.  12). 


These  first-generation  gel-lsased  sequencing  technologies  are  now  being 
used  to  sequence  small  regkans  of  interest  in  the  human  genome.  Although 
investigators  could  use  existing  technology  to  sequence  whole  chromo- 
somes, time  and  cost  considerations  make  large-scale  sequencing  projects  of 
this  nature  impractical.  The  smallest  human  chromosome  (Y)  contains  50  Mb; 
the  largest  (chromosome  1)  has  250  Mb.  The  largest  continuous  DNA 
sequence  obtained  thus  far,  however,  is  approximately  350,000  bp,  and  the 
best  available  equipment  can  sequence  only  50,000  to  100,000  bases  per 
year  at  an  approximate  cost  of  $1  to  $2  per  base.  At  that  rate,  an  unaccept- 
able 30,000  work-years  and  at  least  $3  billkin  woukj  be  required  for  sequenc- 
ing alone. 


Fig.  12.  DNA  Sequencing.  Dideoxy  sequencing  (also  called  chain-termination  or 
Sanger  method)  uses  an  enzymatic  procedure  to  synthesize  DNA  chains  ol  varying 
lengths,  stopping  DNA  replication  at  one  ol  the  tour  bases  and  then  determining  the 
resulting  IragmenI  lengths.  Each  sequencing  reaction  tube  (T,  C,  G,  and  A)  in  the 
diagram  contains 

•  a  DNA  template,  a  phmer  sequence,  and  a  DNA  polymerase  to  initiate  synthesis  of  a 
new  strand  of  DNA  at  the  point  where  the  primer  is  hybridized  to  the  template; 

•  the  tour  deoxynudeotide  inphosphates  (dATP,  dTTP,  dCTP,  anddGTP)  to  extend 
the  DNA  strand: 

•  one  labeled  deoxynudeotide  triphosphate  (using  a  radioactive  element  or  dye):  and 

•  one  i\deoxynucleotide  triphosphate,  which  terminates  the  growing  chain  wherever  it 
is  incorporated.  Tube  A  hasdidATP,  tube  C  has  didCTP,  etc. 

For  example,  in  the  A  reaction  tube  the  ratio  ol  the  dATP  to  iidATP  is  adjusted  so  that 
each  tube  will  ha  ve  a  collection  ol  DNA  fragments  with  a  di  dATP  incorporated  for  each 
adenine  position  on  the  template  DNA  fragments.  The  fragments  of  varying  length  am 
then  separated  by  electrophoresis  (1)  and  the  positions  of  the  nucleotides  analyzed  to 
determine  sequence.  The  fragments  are  separated  on  the  basis  of  size,  with  the  shorter 
fragments  moving  faster  and  appearing  at  the  bottom  of  tfie  gel  Sequence  is  read  from 
bottom  to  top  (2).  (Source:  see  Fig.  1 1.) 


1 .  Sequencing  reactions  loaded 
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fragmem  separation 
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Sequencing  Technologies  Under  Development 

A  major  locus  of  the  Human  Genome  Project  Is  the  development  of  automated  sequenc- 
ing technology  that  can  accurately  sequence  100,000  or  more  bases  per  day  at  a  cost  of 
less  than  $.50  per  base.  Specific  goals  include  the  development  of  sequencing  and 
detection  schemes  that  are  faster  and  more  sensitive,  accurate,  and  economical.  I^any 
novel  sequencing  technologies  are  now  being  explored,  and  the  most  promising  ones  will 
eventually  be  optimized  for  widespread  use. 

Second-generation  (interim)  sequencing  technologies  will  enable  speed  and  accuracy  to 
increase  by  an  order  of  magnitude  (i.e.,  10  times  greater)  while  lowering  the  cost  per  base. 
Some  important  disease  genes  will  be  sequenced  with  such  technologies  as  (1)  high- 
voltage  capillary  and  ultrathin  electrophoresis  to  increase  fragment  separation  rate  and 

(2)  use  of  resonance  ionization  spectroscopy  to  detect  stable  isotope  labels. 

Third-generation  gel-less  sequencing  technologies,  which  aim  to  increase  efficiency  by 
several  orders  of  magnitude,  are  expected  to  be  used  for  sequencing  most  of  the  human 
genome.  These  developing  technologies  include  (1)  enhanced  fluorescence  detection 
of  Individual  labeled  bases  in  flow  cytometry,  (2)  direct  reading  of  the  base  sequence 
on  a  DNA  strand  with  the  use  of  scanning  tunneling  or  atomic  force  microscopies, 

(3)  enhanced  mass  spectrometric  analysis  of  DNA  sequence,  and  (4)  sequencing  by 
hybridization  to  short  (lanels  of  nucleotides  of  known  sequence.  Pilot  large-scale 
sequencing  projects  will  provide  opportunities  to  improve  current  technologies  and  will 
reveal  challenges  Investigators  may  encounter  In  larger-scale  efforts. 

Partial  Sequencing  To  Facilitate  Mapping,  Gene 
Identification 

Correlating  mapping  data  from  different  laboratories  has  been  a  problem  because  of 
differences  in  generating,  isolating,  and  mapping  DNA  fragments.  A  common  reference 
system  designed  to  meet  these  challenges  uses  partially  sequenced  unique  regions  (200 
to  500  bp)  to  Identify  clones,  contigs,  and  long  stretches  of  sequence.  Called  sequence 
tagged  sites  (STSs),  these  short  sequences  have  become  standard  markers  for  physical 
mapping. 

Because  coding  sequences  of  genes  represent  most  of  the  potentially  useful  information 
content  of  the  genome  (but  are  only  a  fraction  of  the  total  DNA),  some  investigators  have 
begun  partial  sequencing  of  cDfslAs  instead  of  reindom  genomic  DNA.  (cDNAs  are  derived 
from  mRNA  sequences,  which  are  the  transcription  products  of  expressed  genes.)  In  addi- 
tion to  providing  unique  markers,  these  partial  sequences  [termed  expressed  sequence 
tags  (ESTs)]  also  idenlify  expressed  genes.  This  strategy  can  thus  provide  a  means  of 
rapidly  identifying  most  human  genes.  Other  applications  of  the  EST  approach  include 
determining  locations  of  genes  atong  chromosomes  and  identifying  coding  regions  in 
genomic  sequences. 
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End  Games:  Completing  Maps  and 
Sequences;  Finding  Specific  Genes 

starting  maps  and  sequences  is  relatively  simple;  finishing  them  will  require  new 
strategies  or  a  combination  of  existing  methods.  After  a  sequence  is  determined  using  the 
methods  described  above,  the  task  remains  to  fill  in  the  many  large  gaps  left  by  current 
mapping  methods.  One  approach  Is  single-chromosome  microdissection,  In  which  a  piece 
Is  physically  cut  from  a  chromosomal  region  of  particular  Interest,  broken  up  Into  smaller 
pieces,  and  amplified  by  PCR  or  cloning  (see  DNA  Amplification).  These  fragments  can 
then  be  mapped  and  sequenced  by  the  methods  previously  described. 

Chromosome  walking,  one  strategy  for  filling  In  gaps.  Involves  hybridizing  a  primer  of 
known  sequence  to  a  clone  from  an  unordered  genomic  library  and  synthesizing  a  short 
complementary  strand  (called  "walking"  along  a  chromosome).  The  complementary  strand 
Is  then  sequenced  and  Its  end  used  as  the  next  primer  for  further  walking;  in  this  way  the 
adjacent,  previously  unknown,  region  Is  Identified  and  sequenced.  The  chromosome  is 
thus  systematically  sequenced  from  one  end  to  the  other.  Because  primers  must  be  syn- 
thesized chemically,  a  disadvantage  of  this  technique  Is  the  large  number  of  different 
primers  needed  to  walk  a  long  distance.  Chromosome  walking  is  also  used  to  locate 
specific  genes  by  sequencing  the  chromosomal  segments  between  markers  that  flank  the 
gene  of  interest  (Fig.  13). 

The  current  human  genetic  map  has  about  1000  markers,  or  1  marker  spaced  every 
3  million  bp;  an  estimated  100  genes  lie  between  each  par  of  markers.  Higher-resolution 
genetic  maps  have  been  made  In  regions  of  particular  interest.  New  genes  can  be  located 
by  combining  genetic  and  physical  map  information  for  a  region.  The  genetic  map  basi- 
cally  describes  gene  order.  Rough  information  about  gene  location  Is  sometimes  available 
also,  but  these  data  must  be  used  with  caution  because  recombination  is  not  equally  likely 
at  all  places  on  the  chromosome.  Thus  the  genetic  map,  compared  to  the  physical  map, 
stretches  in  some  places  and  compresses  In  others,  as  though  it  were  drawn  on  a  rubber 
band. 

The  degree  of  difficulty  in  finding  a  disease  gene  of  Interest  depends  largely  on  what 
information  Is  already  known  about  the  gene  and,  especially,  on  what  kind  of  DNA  alter- 
ations cause  the  disease.  Spotting  the  dlseeise  gene  is  very  difficult  when  disease  results 
from  a  single  altered  ONA  base;  sickle  cell  anemia  Is  an  example  of  such  a  case,  as  are 
probably  most  major  human  Inherited  diseases.  When  disease  results  from  a  large  DNA 
rean-angement,  this  anomaly  can  usually  be  detected  as  alterations  in  the  physical  map  of 
the  region  or  even  by  direct  microscopic  examination  of  the  chromosome.  The  location  of 
these  alterations  pinpoints  the  site  of  the  gene. 

Identifying  the  gene  responsible  for  a  specific  disease  without  a  map  Is  analogous  to 
finding  a  needle  in  a  haystack.  Actually,  finding  the  gene  is  even  more  difficult,  because 
even  close  up,  the  gene  still  looks  like  just  another  piece  of  hay.  However,  maps  give 
clues  on  wh^re  to  look;  the  finer  the  map's  resolution,  the  fewer  pieces  of  hay  to  be  tested. 
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Once  the  neighborhood  of  a  gene  of  interest  has  been  identified,  several  strategies  can  be 
used  to  find  the  gene  itself.  An  ordered  library  of  the  gene  neighborhood  can  be  con- 
structed if  one  is  not  already  available.  This  library  provides  DNA  fragments  that  can  be 
screened  for  additional  polymorphisms,  improving  the  genetic  map  of  the  region  and 
further  restricting  the  possible  gene  location.  In  addition,  DNA  fragments  from  the  region 
can  be  used  as  probes  to  search  for  DNA  sequences  that  are  expressed  (transcribed  to 
RNA)  or  conserved  among  individuals.  Most  genes  will  have  such  sequences.  Then 
Individual  gene  candidates  must  be  examined.  For  example,  a  gene  responsible  for  liver 
disease  is  likely  to  be  expressed  in  the  liver  and  less  likely  in  other  tissues  or  organs.  This 
type  of  evidence  can  further  limit  the  search.  Finally,  a  suspected  gene  may  need  to  be 
sequenced  in  t>oth  healthy  and  affected  individuals.  A  consistent  pattern  of  DNA  variation 
when  these  two  samples  are  compared  will  show  that  the  gene  of  interest  has  very  likely 
been  found.  The  ultimate  proof  is  to  correct  the  suspected  DNA  alteration  in  a  cell  and 
show  that  the  cell's  behavior  reverts  to  nornr^l. 


Fig.  13.  Cloning  m 
Dlaease  Qene  by 
Chromosome  Walking. 

After  a  marker  is  linked  to 
within  1  cM  of  a  disease 
gene,  chromosome 
tvalking  can  be  used  to 
ckme  the  disease  gene 
itself.  A  probe  is  first 
constructed  from  a 
genomic  fragment  iden- 
tified from  a  library  as 
being  the  closest  linked 
marker  to  the  gene.  A 
resthction  Iragment 
isolated  Irom  the  end  of 
the  done  near  the  disease 
kxus  is  used  to  reprobe 
the  genomic  library  tor  an 
overlapping  done.  This 
process  is  repealed  sev- 
eral times  to  walk  across 
the  chromosome  and 
reach  the  flanking  marker 
on  the  other  side  of  the 
disease-gene  tocus. 
(Source:  see  Fig.  11.) 
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Model  Organism  Research 


Most  mapping  and  sequencing  technologies  were  developed  from  studies  of  nonhuman 
genomes,  notably  those  of  the  bacterium  Escherichia  coli,  the  yeast  Saccharomyces 
cerevisiae,  the  fruit  fly  Drosophila  melanogaster,  the  roundworm  Caenorhabditis  elegans, 
and  the  laboratory  mouse  Mus  musculus.  These  simpler  systems  provide  excellent 
models  for  developing  and  testing  the  procedures  needed  for  studying  the  much  more 
complex  human  genome. 

A  large  amount  of  genetic  information  has  already  been  derived  from  these  organisms, 
providing  valuable  data  for  the  analysis  of  normal  gene  regulation,  genetic  diseases,  and 
evolutionary  processes.  Physical  maps  have  been  completed  for  E  coli,  and  extensive 
overlapping  clone  sets  are  available  for  S.  cerevisiae  &r\6  C.  elegans.  In  addition, 
sequencing  projects  have  been  initiated  by  the  NIH  genome  program  for  E.  coli, 
S.  cerevisiae,  and  C.  elegans. 

Mouse  genome  research  will  provide  much  significant  comparative  information  because  of 
the  many  biological  and  genetic  similarities  between  mouse  and  man.  Comparisons  of 
human  and  mouse  DNA  sequences  will  reveal  areas  that  have  been  conserved  during 
evolution  and  are  therefore  important.  An  extensive  database  of  mouse  DNA  sequences 
will  allow  counterparts  of  particular  human  genes  to  be  identified  in  the  mouse  and  exten- 
sively studied.  Conversely,  information  on  genes  first  found  to  be  important  in  the  mouse 
will  lead  to  associated  human  studies.  The  mouse  genetic  map,  Ijased  on  morphological 
markers,  has  already  led  to  many  insights  into  human  biology.  Mouse  models  are  being 
developed  to  explore  the  effects  of  mutations  causing  human  diseases,  including  diabe- 
tes, muscular  dystrophy,  and  several  cancers.  A  genetic  map  based  on  DNA  markers  is 
presently  being  constructed,  and  a  physical  map  is  planned  to  allow  direct  comparison 
with  the  human  physical  map. 


Informatics:  Data  Collection  and  Interpretation 


Collecting  and  Storing  Data 

The  reference  map  and  sequence  generated  by  genome 
research  will  be  used  as  a  primary  information  source  for 
human  biology  and  medicine  far  into  the  future.  The  vast 
amount  of  data  produced  will  first  need  to  be  collected, 
stored,  and  distributed.  If  compiled  in  books,  the  data 
would  fill  an  estimated  200  volumes  the  size  of  a  Manhat- 
tan telephone  book  (at  1000  pages  each),  and  reading  it 
would  require  26  years  working  around  the  clock  (Fig. 14). 

Because  handling  this  amount  of  data  will  require  exten- 
sive use  of  computers,  database  development  will  be  a 
major  focus  of  the  Human  Genome  Project.  The  present 
challenge  is  to  improve  database  design,  software  for 


HUMAN  GENETIC  DIVERSITY: 
The  Ultimate  Human  Genetic  Database 

•  Any  two  individuals  differ  in  about  3  x  106  bases  (0.1%). 

•  The  population  is  now  about  5  x  109. 

•  A  catalog  of  all  sequence  differences  would  require 
15  x  1015  entries. 

•  This  catalog  may  t>e  needed  to  find  the  rarest  or  most 
complex  disease  genes. 
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database  access  and  manipulation,  and  data-entry  procedures  to  compensate  for  the 
varied  computer  procedures  and  systems  used  in  different  laboratories.  Databases  need 
to  be  designed  that  will  accurately  represent  map  information  (linkage.  STSs,  physical 
location,  disease  loci)  and  sequences  (genomic,  cDNAs,  proteins)  and  link  them  to  each 
other  and  to  bibliographic  text  databases  of  the  scientific  and  medical  literature. 


Interpreting  Data 


New  tools  will  also  be  needed  for  analyzing  the  data  from  genome  maps  and  sequences. 
Recognizing  where  genes  begin  and  end  and  identifying  their  exons,  Introns,  and  regula- 
tory sequences  may  require  extensive  comparisons  withi  sequences  from  related  species 
such  as  the  mouse  to  search  for  conserved  similarities  (homologies).  Searching  a  data- 
base for  a  particular  DNA  sequence  may  uncover  these  homologous  sequences  in  a 
known  gene  from  a  model  organism,  revealing  insights  into  the  function  of  the  correspond- 
ing human  gene. 

Correlating  sequence  information  with  genetic  linkage  data  and  disease  gene  research 
will  reveal  the  molecular  basis  for  human  variation.  If  a  newly  identified  gene  is  found  to 
code  for  a  flawed  protein,  the  altered  protein  must  be  compared  with  the  normal  version 
to  identify  the  speclfk;  abnormality  that  causes  disease.  Once  the  error  is  pinpointed, 
researchers  must  try  to  detennine  how  to  con-ect  it  in  the  human  t)ody,  a  task  that  will 
require  knowledge  about  how  the  protein  functions  and  in  which  cells  it  is  active. 


Fig.  14.  Magnttude  of 
Genome  Data.  If  the  DNA 
sequence  of  tt>e  human 
genome  were  compiled  in 
books,  the  equivalent  of 
200  volumes  the  size  of  a 
Manhattan  telephone  book 
(at  1000  pages  each) 
would  be  needed  to  hold 
Hall  New  data-analysis 
tools  will  be  needed 
for  understanding  the 
information  from  genome 
maps  and  sequences. 
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Correct  protein  function  depends  on  the  three-dimensional 
(3D),  or  folded,  structure  the  proteins  assume  in  biological 
environments;  thus,  understanding  protein  structure  will  be 
essential  in  determining  gene  function.  DNA  sequences 
will  be  translated  into  amino  acid  sequences,  and  re- 
searchers will  try  to  make  inferences  about  functions  either 
by  com-paring  protein  sequences  with  each  other  or  by 
comparing  their  specific  3-D  structures  (Fig.  15). 


Because  the  3-D  structure  patterns  (motifs)  that  protein 
molecules  assume  are  much  more  evolutionarily  con- 
served than  amino  acid  sequences,  this  type  of  homology 
search  could  prove  more  fruitful.  Particular  motifs  may 
serve  similar  functions  in  several  different  proteins,  infor- 
mation that  would  be  valuable  in  genome  analyses. 

Currently,  however,  only  a  few  protein  motifs  can  be  recognized  at  the  sequence  level. 
Continued  development  of  analytic  capabilities  to  facilitate  grouping  protein  sequences 
into  motif  families  will  make  homology  searches  more  successful. 


PROTEIN 
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Mapping  Databases 


The  Genome  Data  Base  (GDB),  located  at  Johns  Hopkins  University  (Baltimore,  (Mary- 
land), provides  location,  ordering,  and  distance  information  for  human  genetic  markers, 
probes,  and  contigs  linked  to  known  human  genetic  disease.  GDB  is  presently  working  on 
incorporating  physical  mapping  data.  Also  at  Hopkins  is  the  Online  Mendelian  Inheritance 
in  Man  database,  a  catalog  of  inherited  human  traits  and  diseases. 

The  Human  and  (Mouse  Probes  and  Libraries  Database  (located  at  the  American  Type 
Culture  Collection  in  Rockville,  (Maryland)  and  the  GBASE  mouse  database  (located  at 
Jackson  Laboratory,  Bar  Harbor,  (Maine)  include  data  on  RFLPs,  chromosomal  assign- 
ments, and  probes  from  the  laboratory  mouse. 


Sequence  Databases 

Nucleic  Acids  (DNA  and  RNA) 

Public  databases  containing  the  complete  nucleotide  sequence  of  the  human  genome  and 
those  of  selected  model  organisms  will  be  one  of  the  most  useful  products  of  the  Human 
Genome  Project.  Four  major  public  databases  now  store  nucleotide  sequences:  GenBank 
and  the  Genome  Sequence  DataBase  (GSDB)  in  the  United  States,  European  Molecular 
Biology  Laboratory  (EfMBL)  Nucleotide  Sequence  Database  in  the  United  Kingdom,  and 
the  DNA  Database  of  Japan  (DDBJ).  The  databases  collaborate  to  share  sequences, 
which  are  compiled  from  direct  author  submissions  and  journal  scans.  The  four  databases 
now  house  a  total  of  almost  200  Mb  of  sequence.  Although  hun^n  sequences  predomi- 
nate, more  than  8000  species  are  represented.  [Paragraph  updated  July  1 994] 


Fig.  IS.  Understanding 
Gene  Function. 

Understanding  how 
genes  tunclion  will 
require  analyses  ol  Ihe 
3-D  structures  ol  the 
proteins  lor  which  the 
genes  code. 
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Proteins 

The  major  protein  sequence  databases  are  the  Protein  Identification  Resource  (National 
Biomedical  Research  Foundation),  Swissprot,  and  GenPepI  (both  distributed  with 
GenBank).  In  addition  to  sequence  information,  they  contain  information  on  protein  motifs 
and  other  features  of  protein  structure. 


Impact  of  the  Human  Genome  Project 

The  atlas  of  the  human  genome  will  revolutionize  medical  practice  and  biological 
research  into  the  21st  century  and  beyond.  All  human  genes  will  eventually  be  found,  and 
accurate  diagnostics  will  be  developed  for  most  inherited  diseases.  In  addition,  animal 
models  for  human  disease  research  will  be  more  easily  developed,  facilitating  the  under- 
standing of  gene  function  in  health  and  disease. 

Researchers  have  already  identified  single  genes  associated  with  a  number  of  diseases, 
such  as  cystic  fibrosis,  Duchenne  muscular  dystrophy,  myotonic  dystrophy,  neurofibroma- 
tosis, and  retinoblastoma.  As  research  progresses,  investigators  will  also  uncover  the 
mechanisms  for  diseases  caused  by  several  genes  or  by  a  gene  interacting  with  environ- 
mental factors.  Genetic  susceptibilities  have  been  implicated  in  many  major  disabling  and 
fatal  diseases  including  heart  disease,  stroke,  diabetes,  and  several  kinds  of  cancer.  The 
identification  of  these  genes  and  their  proteins  will  pave  the  way  to  more-effective 
therapies  and  preventive  measures.  Investigators  determining  the  underiying  biology  of 
genome  organization  and  gene  regulation  will  also  begin  to  understand  how  humans 
develop  from  single  cells  to  adults,  why  this  process  sometimes  goes  awry,  and  what 
changes  take  place  as  people  age. 

New  technologies  developed  for  genome  research  will  also  find  myriad  applications  in 
industry,  as  well  as  in  projects  to  map  (and  ultimately  improve)  the  genomes  of  economi- 
cally important  farm  animals  and  crops. 

While  human  genome  research  itself  does  not  pose  any  new  ethical  dilemmas,  the  use  of 
data  arising  from  these  studies  presents  challenges  that  need  to  be  addressed  before  the 
data  accumulate  significantly.  To  assist  in  policy  development,  the  ethics  component  of 
the  Human  Genome  Project  is  funding  conferences  and  research  projects  to  identify  and 
consider  relevant  issues,  as  well  as  activities  to  promote  public  awareness  of  these  topics. 
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Portions  o(  the 
glossary  text  were 
taken  directly  or 
modified  from  defini- 
tions in  the  U.S. 
Congress  Office  of 
Technology  Assess- 
ment document: 
Mapping  Our 
Genes — The  Genome 
Projects:  How  Big, 
How  Fast?  OTA-BA- 
373,  Washington, 
D.C.:  U.S.  Govern- 
ment Printing  Office, 
April  1988. 


Adenine  (A):  A  nitrogenous  base,  one  member  of  the  base  pair  ^-T  {adenme- thyminefj. 

Alleles:  Alternative  forms  of  a  genetic  locus,  a  single  allele  for  each  locus  is  inherited 
separately  from  each  parent  (e.g.,  at  a  locus  for  eye  color  the  allele  might  result  in  blue  or 
brown  eyes). 

Amino  acid:  Any  of  a  class  of  20  molecules  that  are  combined  tolorm  proteins'm  living 
things.  The  sequence  of  amino  acids  in  a  protein  and  hence  protein  function  are  deter- 
mined by  the  genetic  code. 

Amplification:  An  increase  in  the  number  of  copies  of  a  specific  DNA  fragment;  can  be  in 
vivo  or  In  vitro.  See  cloning,  polymerase  chain  reaction. 

Arrayed  library:  Individual  primary  recombinant  clones  (hosted  in  phage,  cosmid,  VAC, 
or  other  vector)  that  are  placed  in  two-dimensional  arrays  in  microtiter  dishes.  Each 
primary  clone  can  be  identified  by  the  identity  of  the  plate  and  the  clone  location  (row  and 
column)  on  that  plate.  Arrayed  libraries  of  clones  can  be  used  for  many  applications, 
including  screening  for  a  specific  gene  or  genomic  region  of  interest  as  well  as  for  physical 
mapping  Information  gathered  on  individual  clones  from  various  genetic  linkage  and 
physical  map  analyses  Is  entered  into  a  relational  database  and  used  to  construct  physical 
and  genetic  linkage  /n^^tis  simultaneously;  clone  Identifiers  serve  to  inten-elate  the  multi- 
level maps.  Compare  library,  genomic  library. 

Autoradiography:  A  technique  that  uses  X-ray  film  to  visualize  radioacfively  labeled 
molecules  or  fragments  of  molecules;  used  In  analyzing  length  and  number  of  DNA 
fragments  after  they  are  separated  by  gel  electrophoresis. 

Autosome:  A  chromosome r\o\  involved  In  sex  determination.  The  (y/iato«/human  genome 
consists  of  46  chromosomes,  22  pairs  of  autosomes,  and  1  pair  of  sex  chromosomes  (the 
X  and  Y  chromosomes). 

Bacteriophage:  See  phage. 

Base  pair  (bp):  Two  nitrogenous  bases  (adenine  and  thymine  or  guanine  and  cytosine) 
held  together  by  weak  bonds.  Two  strands  of  DNA  are  held  together  In  the  shape  of  a 
double  helix  by  the  bonds  between  base  pairs. 

Base  sequence:  The  order  of  nucleotide  bases  in  a  DNA  molecule. 

Base  sequence  analysis:  A  method,  sometimes  automated,  for  determining  the  base 
sequence. 

Biotechnology:  A  set  of  biological  techniques  developed  through  bask;  research  and  now 
applied  to  research  and  product  development.  In  particular,  the  use  by  Industry  of  recom- 
binant DNA,  cell  fusion,  and  new  bioprocessing  techniques. 

bp:  See  base  pair 

cDNA:  See  complementary  DNA. 
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Centlmorgan  (cM):  A  unit  of  measure  of  recombination  frequency.  One  centimorgan  is 
equal  to  a  1  %  chance  that  a  marker  at  one  genetic  locus  will  be  separated  from  a  marker 
at  a  second  locus  due  to  crossing  over'm  a  single  generation.  In  human  beings,  1  centl- 
morgan is  equivalent,  on  average,  to  1  million  base  pairs. 

Centromere:  A  sp>ecialized  chromosome  region  to  which  spindle  fibers  attach  during  cell 
division. 

Chromosomes:  The  self-replicating  genetic  structures  of  cells  containing  the  ceHular 
DNA  that  bears  in  its  nucleotide  sequence  the  linear  an-ay  of  genes.  In  prokaryotes, 
chromosomal  DNA  is  circular,  and  the  entire  genome  is  carried  on  one  chromosome. 
Eukaryotic  genomes  consist  of  a  number  of  chromosomes  whose  DNA  is  associated  with 
different  kinds  of  proteins. 

Clone  bank:  See  genomic  library. 

Clones:  A  group  of  cells  derived  from  a  single  ancestor. 

Cloning:  The  process  of  asexually  producing  a  group  of  cells  (clones),  all  genetically 
identical,  from  a  single  ancestor.  In  recombinant  DNA  technology,  the  use  of  DNA  ma- 
nipulation procedures  to  produce  multiple  copies  of  a  single  gene  or  segment  of  DNA  is 
referred  to  as  cloning  DNA. 

Cloning  vector:  DNA  molecule  originating  from  a  virus,  a  plasmid,  or  the  cell  of  a  higher 
organism  into  which  another  DNA  fragment  of  appropriate  size  can  be  integrated  without 
loss  of  the  vector's  capacity  for  self-replication;  vectors  introduce  foreign  DNA  into  host 
cells,  where  it  can  be  reproduced  in  large  quantities.  Examples  are  plasmids,  cosmkfs, 
and  yeast  artificial  chromosomes,  vectors  are  often  recombinant  molecules  containing 
DNA  sequences  from  several  sources. 

cM:  See  centimorgan. 

Code:  See  genetic  code. 

Codon:  See  genetic  code. 

Complementary  DNA  (cDNA):  DNA  that  is  synthesized  from  a  messenger  RNA  tem- 
plate; the  single-stranded  form  is  often  used  as  a  probe  in  physical  mapping. 

Complementary  sequences:  Nucleic  acid  base  sequences  that  can  form  a  double- 
stranded  structure  by  matching  base  pairs,  the  complementary  sequence  to  G-T-A-C  is 
C-A-T-G. 

Conserved  sequence:  A  base  sequence  in  a  DNA  molecule  (or  an  amino  ac/ic/ sequence 
in  a  proteirt)  that  has  remained  essentially  unchanged  throughout  evolution. 

Contig  map:  A  map  depicting  the  relative  order  of  a  linked  library  q\  small  overiapping 
clones  representing  a  complete  chromosomal  segment 
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Contigs:  Groups  of  clones  representing  overlapping  regions  of  a  genome. 

Cosmid:  Artificially  constructed  cloning  i/'ec/o/' containing  the  cos  gene  of  phage  lanDbda. 
Cosmids  can  be  packaged  in  lamtxla  phage  particles  for  infection  into  E.  colt,  this  permits 
cloning  of  larger  DNA  fragments  (up  to  45  kb)  than  can  be  introduced  into  bacterial  hosts 
in  plasmid  vectors. 

Crossing  over:  The  breaking  during  meiosis  of  one  maternal  and  one  paternal  chromo- 
some, the  exchange  of  corresponding  sections  of  DNA,  and  the  rejoining  of  the  chromo- 
somes. This  process  can  result  in  an  exchange  of  a/te/es  between  chromosomes.  Com- 
pare recombination. 

Cytoslne  (C):  A  nitrogenous  base,  one  member  of  the  base  pair  Q-C  (guanine  and 
cytosine). 

Deoxyrlbonucleotlde:  See  nucleotide. 

Diploid:  A  full  set  of  genetic  material,  consisting  of  paired  chromosomes — one  chromo- 
some from  each  parental  set.  Most  animal  cells  except  the  gametes  have  a  diploid  set  of 
chromosomes.  The  diploid  human  genome  has  46  chromosomes.  Compare  haploid. 

DNA  (deoxyribonucleic  acid):  The  molecule  that  encodes  genetic  information.  DNA  is  a 
double-stranded  molecule  held  together  by  weak  t)onds  between  base  pairs  o\  nucleoti- 
des. The  four  nucleotides  in  DNA  contain  the  bases:  adenine  (A),  guanine  (G),  cytosine 
(C),  and  thymine  (T).  In  nature,  base  pairs  form  only  between  A  and  T  and  between  G  and 
C;  thus  the  base  sequence  of  each  single  strand  can  be  deduced  from  that  of  its  partner. 

DNA  probes:  See  probe. 

DNA  replication:  The  use  of  existing  DNA  as  a  template  for  the  synthesis  of  new  DNA 
strands.  In  humans  and  other  eukaryotes,  replication  occurs  in  the  cell  nucleus. 

DNA  sequence:  The  relative  order  of  base  pairs,  whether  in  a  fragment  of  DNA,  a  gene, 
a  chromosome,  or  an  entire  genome.  See  base  sequence  analysis. 

Domain:  A  discrete  portion  of  a  protein  viAh  its  own  function.  The  combination  of  domains 
in  a  single  protein  determines  its  overall  function. 

Double  helix:  The  shape  that  two  linear  strands  of  DNA  assume  when  bonded  together. 

f.  co/r.  Common  bacterium  that  has  been  studied  intensively  by  geneticists  because  of  its 
small  genome  size,  normal  lack  of  pathogenicity,  and  ease  of  growth  in  the  laboratory. 

Electrophoresis:  A  method  of  separating  large  molecules  (such  as  DNA  fragments  or 
protein^  from  a  mixture  of  similar  molecules.  An  electric  current  is  passed  through  a 
medium  containing  the  mixture,  and  each  kind  of  molecule  travels  through  the  medium  at 
a  different  rate,  depending  on  its  electrical  charge  and  size.  Separation  is  based  on  these 
differences.  Agarose  and  acrylamide  gels  are  the  media  commonly  used  for  electrophore- 
sis of  proteins  and  nucleic  acids. 
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Endonuclease:  An  enzyme  that  cleaves  its  nucleic  acid  substrate  at  internal  sites  in  the 
nucleotide  sequence. 

Enzyme:  A  protein  thai  acts  as  a  catalyst,  speeding  the  rate  at  which  a  biochemical 
reaction  proceeds  but  not  altering  the  direction  or  nature  of  the  reaction. 

EST:  Expressed  sequence  tag.  See  sequence  tagged  site. 

Eukaryote:  Cell  or  organism  with  membrane-bound,  structurally  discrete  nucleus  an6 
other  well-developed  subcellular  compartments.  Eukaryotes  include  all  organisms  except 
vinises,  bacteria,  and  blue-green  algae.  Compare  prokaryote.  See  chromosomes. 

Evolutlonarlly  conserved:  See  conserved  sequence. 

Exogenous  DNA:  DNA  originating  outside  an  organism. 

Exons:  The  proteirycodmg  DNA  sequences  of  a  gene.  Compare  introns. 

Exonuclease:  An  enzyme  that  cleaves  nucleotides  sequentially  from  free  ends  of  a  linear 
nucleic  acid  substrate. 

Expressed  gene:  See  gene  expression. 

RSH  (fluorescence  In  situ  hybridization):  A  physical  mapping  approach  that  uses 
fluorescein  tags  to  detect  hybridization  o\  prot>es<*iHh  metaphase  chromosomes  an6  with 
the  less-condensed  somatic  interphase  chromatin. 

Flow  cytometry:  Analysis  of  biological  material  by  detection  of  the  light-eibsorbing  or 
fluorescing  properties  of  cells  or  sut)cellular  fractions  (i.e.,  chromosomes)  passing  in  a 
narrow  stream  through  a  laser  beam.  An  absorbance  or  fluorescence  profile  of  the  sample 
is  produced.  Automated  sorting  devices,  used  to  fractionate  samples,  sort  successive 
droplets  of  the  analyzed  stream  into  different  fractions  depending  on  the  fluorescence 
emitted  by  each  droplet. 

Flow  karyotyping:  Use  of  flow  cytometry  to  analyze  and/or  separate  chromosomes  on 
the  basis  of  their  DNA  content. 

Gamete:  Mature  male  or  female  repnsductlve  cell  (sperm  or  ovum)  with  a  haploidset  of 
chromosomes  (23  for  humans). 

Gene:  The  fundamental  physical  and  functional  unit  of  heredity.  A  gene  is  an  ordered 
sequence  of  nucleotides  located  in  a  particular  position  on  a  particular  chromosome  that 
encodes  a  specific  functional  product  (I.e.,  a  protein  or  RNA  molecule).  See  gene  expres- 
sion. 

Gene  expression:  The  process  by  which  a  gene's  coded  information  is  converted  into  the 
structures  present  and  operating  in  the  cell.  Expressed  genes  include  those  that  are 
transcribed  into  mRNA  and  then  translated  into  protein  and  those  that  are  transcribed  into 
RNA  but  not  translated  into  protein  (e.g.,  tnansferar>d  ritmsomal  RNA^. 
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Gene  families:  Groups  of  closely  related  genesthat  make  similar  products. 

Gene  library:  See  genomic  library. 

Gene  mapping:  Determination  of  the  relative  positions  of  gerws  on  a  DNA  molecule 
{chromosome  or  plasmid)  and  of  the  distance,  in  linltage  units  or  physical  units,  t)etween 
them. 

Gene  product:  The  biochemical  material,  either  RNA  or  proteir),  resulting  from  expression 
of  a  gene.  The  amount  of  gene  product  is  used  to  measure  how  active  a  gene  is;  abnor- 
mal amounts  can  be  correlated  with  disease-causing  alleles. 

Genetic  code:  The  sequence  of  nucleotides,  coded  in  triplets  (codon^  atong  the  mRNA, 
that  determines  the  sequence  of  amino  acids  in  protein  synthesis.  The  DNA  sequence  of 
a  ger)e  can  be  used  to  predict  the  mRNA  sequence,  and  the  genetic  code  can  in  turn  be 
used  to  predict  the  amino  acid  sequence. 

Genetic  engineering  technologies:  See  recombinant  DNA  technologies. 

Genetic  map:  See  linkage  map. 

Genetic  material:  See  genome. 

Genetics:  The  study  of  the  patterns  of  inheritance  of  specific  traits. 

Genome:  All  the  genetic  material  in  the  chromosomes  of  a  particular  organism;  its  size  is 
generally  given  eis  its  total  number  of  base  pairs. 

Genome  prolects:  Research  and  technology  development  efforts  aimed  at  mapping  and 
sequencing  some  or  all  of  the  genome  of  human  beings  and  other  organisms. 

Genomic  library:  A  collection  of  clones  made  from  a  set  of  randomly  generated  overlap- 
ping DNA  fragments  representing  the  entire  genome  oi  an  organism.  Compare  library, 
arrayed  library. 

Guanine  (G):  A  nitrogenous  base,  one  member  of  the  base  pair  G-C  (guanine  and 
cytosin^. 

Haplold:  A  single  set  of  chromosomes  (half  the  full  set  of  genetic  material),  present  in  the 
egg  and  sperm  cells  of  animals  and  in  the  egg  and  pollen  cells  of  plants.  Human  beings 
have  23  chromosomes  in  their  reproductive  cells.  Compare  diploid 

Heterozygosity:  The  presence  of  different  alleles  at  one  or  more  loci  on  homologous 
chromosomes. 

Homeot>ox:  A  short  stretch  of  nucleotides  whose  base  sequence  is  virtually  identical  in 
all  the  pe/Testhat  contain  it.  It  has  been  found  in  many  organisms  from  fruit  flies  to  human 
beings.  In  the  fruit  fly,  a  homeobox  appears  to  determine  when  particular  groups  of  genes 
are  expressed  during  development. 
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Homologies:  Similarities  in  DNA  or  protein  sequences  between  individuals  of  the  same 
species  or  among  different  species. 

Homologous  chromosomes:  A  pair  of  chromosomes  corA&wimQ  the  same  linear  gene 
sequences,  each  derived  from  one  parent. 

Human  gene  therapy:  Insertion  of  normal  DNA  directly  into  cells  to  correct  a  genetic 
defect. 

Human  Genome  Initiative:  Collective  name  for  several  projects  begun  in  1 986  by  DOE 
to  (1)  create  an  ordered  set  of  DNA  segments  from  known  chromosomal  locations,  (2) 
develop  new  computational  methods  for  analyzing  genetic  map  and  DNA  sequence  data, 
and  (3)  develop  new  techniques  and  instruments  for  detecting  and  analyzing  DNA.  This 
DOE  initiative  is  now  l<nown  as  the  Human  Genome  Program.  The  national  effort,  led  by 
DOE  and  NIH,  is  l<nown  as  the  Human  Genome  Project. 

Hybridization:  The  process  of  joining  two  complementary  sbanis  of  DNA  or  one  each  of 
DNA  and  RNA  to  form  a  double-stranded  molecule. 

Informatics:  The  study  of  the  application  of  computer  and  statistical  techniques  to  the 
management  of  information.  In  perto/rye  projects,  informatics  includes  the  development  of 
methods  to  search  databases  quickly,  to  analyze  DNA  sequence  information,  and  to 
predict  protein  sequence  and  structure  from  DNA  sequence  data. 

In  situ  hybridization:  Use  of  a  DNA  or  RNA  probe  to  detect  the  presence  of  the  comple- 
mentary DNA  sequence  in  cloned  bacterial  or  cultured  eulcaryotic ceWs. 

Interphase:  The  period  in  the  cell  cycle  when  DNA  is  replicated  in  the  nucleus;  followed 
by  mitosis. 

Introns:  The  DNA  base  sequences mlerniplmg  the  protein-coding  sequences  of  a  gene; 
these  sequences  are  transcrit)ed \vXo  RNA  but  are  cut  out  of  the  message  before  it  is 
translated \r\o  protein.  Compare  exons. 

In  vitro:  Outside  a  living  organism. 

Karyotype:  A  photomicrograph  of  an  individual's  cfiromosomes  arranged  in  a  standard 
format  showing  the  number,  size,  and  shape  of  each  chromosome  type;  used  in  low- 
resolution  physical  mapping  lo  correlate  gross  chromosomal  abnormalities  with  the 
characteristics  of  specific  diseases. 

kb:  See  kilobase. 

Kllobase  (kb):  Unit  of  length  for  DNA  fragments  equal  to  1000  nucleotides. 

Library:  An  unordered  collection  of  clones  (i.e.,  cloned  DNA  from  a  particular  organism), 
whose  relationship  to  each  other  can  be  established  by  physical  mapping.  Compare 
genomic  library,  arrayed  library. 
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Linkage:  The  proximity  o(  two  or  more  markers  (e.g.,  genes.  /7fl/' markers)  on  a  chro- 
mosome, the  closer  together  the  markers  are,  the  lower  the  probability  that  they  will  be 
separated  during  DNA  repair  or  replication  processes  (binary  fission  in  prokaryotes. 
mitosis  or  meiosis  in  eukaryotes),  and  hence  the  greater  the  probability  that  they  will  be 
inherited  together. 

Linkage  map:  A  map  of  the  relative  positions  of  genetic  ioci  on  a  chromosome,  deter- 
mined on  the  basis  of  how  often  the  loci  are  inherited  together.  Distance  is  measured  in 
cenlirrmrgans  (cMi. 

Localize:  Determination  of  the  original  position  {locus)  o\  a  gene  or  other  markeron  a 
chromosome. 

Locus  (pi.  loci):  The  position  on  a  chromosome  of  a  gene  or  other  chromosome  marker. 
also,  the  DNA  at  that  positk>n.  The  use  of  locus  is  sometimes  restricted  to  mean  regions 
of  DNA  that  are  expressed  See  gene  expression 

Macrorestrlctlon  map:  Map  deputing  the  order  of  and  distance  between  sites  at  which 
restriction  enzymes  cleave  chromosomes 

Mapping:  See  gene  mapping,  linkage  map,  physical  map. 

Marker:  An  identifiable  physical  tocation  on  a  chromosome  (e.g.,  restriction  enzyme 
cutting  site  gene)  whose  inheritance  can  be  monitored.  Markers  can  be  expressed 
regk)ns  of  DNA  (genes)  or  some  segment  of  DNA  with  no  known  coding  functk)n  but 
whose  pattern  of  inheritance  can  be  determined.  See  RFLP,  resthctk)n  fragment  length 
polymorphism. 

Mb:  See  megabase. 

Megabase  (Mb):  Unit  of  length  for  DNA  fragments  equal  to  1  million  nucleotides  and 
roughly  equal  to  1  cM. 

Meiosis:  The  process  of  two  consecutive  cell  divisions  in  the  <//;nto/i(/progenitors  of  sex 
cells.  Meiosis  results  in  four  rather  than  two  daughter  cells,  each  with  a  haploidsB\  of 
chrorrtosomes 

Messenger  RNA  (mRNA):  RNA  that  serves  as  a  template  for  protein  synthesis.  See 
genetic  code. 

Melaphase:  A  stage  in  mitosis  ot  /nems/s  during  which  the  chromosomes  are  aligned 
ak>ng  the  equatorial  plar>e  of  the  cell. 

MKosIs:  The  process  of  nuclear  division  in  cells  that  produces  daughter  cells  that  are 
genetically  kJentical  to  each  other  arxl  lo  the  parent  cell. 

mRNA:  See  messenger  RNA. 

MuHifactoriai  or  multlgenic  disorders:  See  polygenic  disorders. 
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Multiplexing:  A  sequencing  approach  that  uses  several  pooled  samples  simultaneously, 
greatly  increasing  sequencing  speed. 

Mutation:  Any  heritable  change  in  DNA  sequence.  Compare  polymorphism. 

Nitrogenous  base:  A  nitrogen-containing  molecule  having  the  chemical  properties  of  a 
base. 

Nucleic  acid:  A  large  molecule  composed  of  nucleotide  subunits. 

Nucleotide:  A  subunit  of  DNA  or  RNA  consisting  of  a  nitrogenous  base  (adenine,  gua- 
nine, thymine,  or  cytosine  in  DNA;  adenine,  guanine,  uracil,  or  cytosine  in  RNA),  a  phos- 
phate molecule,  and  a  sugar  molecule  (deoxyribose  in  DNA  and  ribose  in  RNA).  Thou- 
sands of  nucleotides  are  linked  to  form  a  DNA  or  RNA  molecule.  See  DNA,  base  pair 
RNA. 

Nucleus:  The  cellular  organelle  in  eukaryotes  K\\a\  contains  the  genetic  material. 

Oncogene:  A  gene,  one  or  more  forms  of  which  Is  associated  with  cancer.  Many 
oncogenes  are  Involved,  directly  or  indirectly,  in  controlling  the  rate  of  cell  growth. 

Overlapping  clones:  See  genomic  library. 

PCR:  See  polymerase  chain  reaction. 

Phage:  A  vinjs\or  which  the  natural  host  is  a  bacterial  cell. 

Pfiyslcal  map:  A  map  of  the  locations  of  identifiable  landmarks  on  DNA  (e.g.,  restriction 
enzyme  cutting  sites,  genes),  regardless  of  inheritance.  Distance  is  measured  in  base 
pairs.  For  the  human  genome,  the  lowest-resolution  physical  map  is  the  banding  patterns 
on  the  24  different  chromosomes,  the  highest-resolution  map  would  be  the  complete 
nucleotide  sequence  of  the  chromosomes. 

Plasmid:  Autonomously  replicating,  extrachromosomal  circular  DNA  molecules,  distinct 
from  the  normal  bacterial  genome  and  nonessential  for  cell  survival  under  nonselective 
conditions.  Some  plasmids  are  capable  of  integrating  into  the  host  genome.  A  number  of 
artificially  constructed  plasmids  are  used  as  cloning  vectors. 

Polygenic  disorders:  Genetic  disorders  resulting  from  the  combined  action  of  alleles o\ 
more  than  one  gene  (e.g.,  heart  disease,  diabetes,  and  some  cancers).  Although  such 
disorders  are  inherited,  they  depend  on  the  simultaneous  presence  of  several  alleles;  thus 
the  hereditary  patterns  are  usually  more  complex  than  those  of  single-gene  disorders. 
Compare  single-gene  disorders. 

Polymerase  chain  reaction  (PCR):  A  method  for  amplifying  a  DNA  base  sequence  using 
a  heat-stable  polymerase  and  two  20-base  primers,  one  complementary  U}  the  (+)-strand 
at  one  end  of  the  sequence  to  be  amplified  and  the  other  complementary  to  the  (-)-strand 
at  the  other  end.  Because  the  newly  synthesized  DNA  strands  can  subsequently  serve 
as  additional  templates  for  the  same  primer  sequences,  successive  rounds  of  primer 
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annealing,  strand  elongation,  and  dissociation  produce  rapid  and  highly  specific  amplifica- 
tion of  the  desired  sequence.  PCR  also  can  be  used  to  detect  the  existence  of  the  defined 
sequence  in  a  DNA  sannple. 

Polymerase,  DNA  or  RNA:  Enzymes  Ihat  catalyze  the  synthesis  of  nucleic  acids  on 
preexisting  nucleic  acid  templates,  assembling  RNA  from  ribonucleotides  or  DNA  from 
deoxyribonucleotides. 

Polymorphism:  Difference  in  DNA  sequence  among  individuals.  Genetic  variations 
occurring  in  more  than  1%  of  a  population  would  be  considered  useful  polymorphisms  for 
genetic  linl<age  analysis.  Compare  mutation 

Primer:  Short  preexisting  polynucleotide  chain  to  which  new  deoxyribonucleotides  can  be 
added  by  DNA  polymerase. 

Probe:  Single-stranded  DNA  or  RNA  molecules  of  specific  base  sequence,  labeled 
either  radioactively  or  immunologically,  that  are  used  to  detect  the  complementary  bass 
sequence  by  hybridization. 

Prokaryote:  Cell  or  organism  lacking  a  membrane-bound,  structurally  discrete  nucleus 
and  other  subcellular  compartments.  Bacteria  are  prokaryotes.  Compare  eukaryote.  See 
chromosomes. 

Promoter:  A  site  on  DNA  to  which  RNA  polymerase  v/\\\  bind  and  initiate  transcription. 

Protein:  A  large  molecule  composed  of  one  or  more  chains  of  amino  acids  in  a  specific 
order;  the  order  is  determined  by  the  base  sequence  of  nucleotides  in  the  gene  coding  for 
the  protein.  Proteins  are  required  for  the  structure,  function,  and  regulation  of  the  bod/s 
cells,  tissues,  and  organs,  and  each  protein  has  unique  functions.  Examples  are  hor- 
mones, enzymes,  and  antibodies. 

Purine:  A  nitrogen-containing,  single-ring,  basic  compound  that  occurs  in  nucleic  acids. 
The  purines  in  DNA  and  RNA  are  adenine  and  guanine. 

Pyrimldlne:  A  nitrogen-containing,  double-ring,  basic  compound  that  occurs  in  nucleic 
acids.  The  pyrimidines  in  DNA  are  cytosine  and  thymine;  in  RNA,  cytosine  and  uracil. 

Rare-cutter  enzyme.  See  restriction  enzyme  cutting  site 

Recombinant  clones:  C/or>es containing  recombinant  DNA  molecules.  See  recombinant 
DNA  technologies. 

Recombinant  DNA  molecules:  A  combination  of  DNA  molecules  of  different  origin  that 
are  joined  using  recombinant  DNA  technologies. 
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Recombinant  ONA  technologies:  Procedures  used  to  join  together  DNA  segments  in  a 
cell-free  system  (an  environment  outside  a  cell  or  organism).  Under  appropriate  condi- 
tions, a  recombinant  DNA  molecule  can  enter  a  cell  and  replicate  there,  either  autono- 
mously or  after  it  has  become  integrated  into  a  cellular  chmmosome. 

Recombination:  The  process  by  which  progeny  derive  a  combination  of  pe/Tes  different 
from  that  of  either  parent.  In  higher  organisms,  this  can  occur  by  crossing  over. 

Regulatory  regions  or  sequences:  A  DNA  base  sequence  that  controls  gene  expres- 
sion. 

Resolution:  Degree  of  molecular  detail  on  a  physical  map  o\  DNA,  ranging  from  low  to 
high. 

Restriction  enzyme,  endonuclease:  A  protein  that  recognizes  specific,  short  nucleotide 
sequences  and  cuts  DNA  at  those  sites.  Bacteria  contain  over  400  such  enzymes  that 
recognize  and  cut  over  100  different  DNA  sequences.  See  restriction  enzyme  cutting  site. 

Restriction  enzyme  cutting  site:  A  sF>ecific  nucieodde  sequence  o\  DNA  at  which  a 
particular  restriction  enzymecuts  the  DNA.  Some  sites  occur  frequently  in  DNA  (e.g., 
every  several  hundred  base  pairs),  others  much  less  frequently  {rare-cutter,  e.g.,  every 
10,000  base  pairs). 

Restriction  fragment  length  polymorphism  (RFLP):  Variation  between  individuads  in 
DNA  fragment  sizes  cut  by  specific  restriction  enzymes:,  polymorptiic  sequences  Xhai. 
result  in  RFLPs  are  used  as  marlwrs  on  troth  pliysical  maps  and  genetic  linltage  maps. 
RFLPs  are  usually  caused  by  mutation  aX  a  cutting  site.  See  martter 

RFLP:  See  restriction  fragment  length  polymorphism. 

Ribonucleic  acid  (RNA):  A  chemical  found  in  the  nucleus  and  cytoplasm  of  cells:  it  plays 
an  important  role  in  protein  synthesis  and  other  chemical  activities  of  the  cell.  The  struc- 
ture of  RNA  is  similar  to  that  of  DNA.  There  are  several  classes  of  RNA  molecules, 
including  messenger  RNA,  transfer  RNA,  ribosomal  RNA,  and  other  small  RNAs.  each 
serving  a  different  purpose. 

Ribonucleotides:  See  nucleotide. 

Ribosomal  RNA  (rRNA):  A  class  of  RNA  found  in  the  ribosomes  of  cells. 

RIbosomes:  Small  cellular  components  composed  of  specialized  ribosomal  RNA  and 
protein;  site  of  protein  synthesis.  See  ribonucleic  acid  {RNA). 

RNA:  See  ribonucleic  acid. 

Sequence  .'See  base  sequence. 
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Sequence  tagged  site  (STS):  Short  (200  to  500  base  pairs)  DNA  sequence  that  has  a 
single  occurrence  in  the  human  genome  and  whose  location  and  base  sequence  are 
known.  Detectable  by  polymerase  chain  reaction,  STSs  are  useful  for  localizing  and 
orienting  the  mapping  and  sequence  data  reported  from  many  different  laboratories  and 
serve  as  landmarks  on  the  developing  physical  map  o\  the  human  genome.  Expressed 
sequence  tags  (ESTs)  are  STSs  derived  from  cDNAs. 

Sequencing:  Determination  of  the  order  of  nucleotides  (base  sequence^  in  a  DNA  or 
RNA  molecule  or  the  order  of  amino  acids  m  a  protein. 

Sex  chromosomes:  The  X  and  Y  chromosomes  in  human  beings  that  determine  the  sex 
of  an  individual.  Females  have  two  X  chromosomes  in  diploid  cells;  males  have  an  X  and 
a  Y  chromosome.  The  sex  chromosomes  comprise  the  23rd  chromosome  pair  in  a 
karyotype.  Compare  autosome. 

Shotgun  method:  Cloning  of  DNA  fragments  randomly  generated  from  a  genome.  See 
library,  genomic  library. 

Single-gene  disorder:  Hereditary  disorder  caused  by  a  mutanta\\e\e  of  a  single  gene 
(e.g.,  Duchenne  muscular  dystrophy,  retinoblastoma,  sickle  cell  disease).  Compare 
polygenic  disorders. 

Somatic  cells:  Any  cell  in  the  body  except  gametes  and  their  precursors. 

Southern  blotting:  Transfer  by  absorption  of  DNA  fragments  separated  in  electrophoretic 
gels  to  membrane  filters  for  detection  of  specific  base  sequences  b'/  radiolabeled  comple- 
mentary probes. 

STS:  See  sequence  tagged  site. 

Tandem  repeat  sequences:  Multiple  copies  of  the  same  base  sequence  on  a  chromo- 
some, used  as  a  marker  in  physical  mapping. 

Technology  transfer:  The  process  of  converting  scientific  findings  from  research  lat)ora- 
tories  into  useful  products  by  the  commercial  sector. 

Telomere:  The  ends  of  chromosomes.  These  specialized  structures  are  involved  in  the 
replication  and  stability  of  linear  DNA  molecules.  See  DNA  replication. 

Thymine  (T):  A  nitrogenous  base,  one  member  of  the  base  pair  A-J  {adenine-lhy  mine). 

Transcription:  The  synthesis  of  an  RNA  copy  from  a  sequence  of  DNA  (a  gene);  the  first 
step  in  gene  expression.  Compare  translation. 

Transfer  RNA  (tRNA):  A  class  of  RNA  having  structures  with  triplet  nucleotide  sequences 
that  are  complementary  lo  the  triplet  nucleotide  coding  sequences  of  mRNA.  The  role  of 
tRNAs  in  protein  synthesis  is  to  bond  with  amino  acids  and  transfer  them  to  the  ribo- 
somes,  where  proteins  are  assembled  according  to  the  genetic  code  carried  by  mRNA. 
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Transformation:  A  process  by  which  the  genetic  material  carried  by  an  individual  cell  is 
altered  by  incorporation  of  exogenous  DNA  into  its  genome. 

Translation:  The  process  in  which  the  genetic  code  carried  by  mRNA  directs  the  synthesis 
of  proteins  from  amino  acids.  Compare  transcription. 

tRNA:  See  transfer RNA. 

Uracil:  A  nitrogenous  base  normally  found  in  RNA  but  not  DNA;  uracil  is  capable  of 
forming  a  base  pair  WAh  adenine. 

Vector:  See  cloning  vector 

Virus:  A  noncellular  biological  entity  that  can  reproduce  only  within  a  host  cell.  Viruses 
consist  of  nucleic  acid  covered  by  protein;  some  animal  viruses  are  also  surrounded  by 
membrane.  Inside  the  infected  cell,  the  virus  uses  the  synthetic  capability  of  the  host  to 
produce  progeny  virus. 

VLSI;  Very  large-scale  integration  allowing  over  100,000  transistors  on  a  chip. 

YAC:  See  yeast  artificial  chromosome. 

Yeast  artificial  chromosome  (YAC):  A  vector  used  to  clone  DNA  fragments  (up  to  400 
kb);  It  Is  constructed  from  the  telomerlc,  centromeric,  and  replication  origin  sequences 
needed  for  replication  in  yeast  cells.  Compare  cloning  vector,  cosmid. 
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